FAD - FUNDAMENTOS DE ANÁLISIS DE DATOS

1. Introducción

La práctica a desarrollar consiste en la elaboración y presentación de un informe de un proyecto de Ciencia de Datos, utilizando las técnicas aprendidas durante la asignatura, aplicadas al conjunto de datos seleccionados.

1.1. Lenguaje de programación y herramienta de control de versiones utilizados

El grupo eligió trabajar en lenguage R (RStudio version 1.4.1717) y utilizar como herramienta de control de versiones GitHub. El proyecto “/practica_ml” fue creado por Isabela Ignacio (usuario IsaPires1329)y compartido con los restantes participantes del grupo Luisa Yánez (usuario lyanezgu) y Miguel García (usuario mgarciasanc2021).

Link del proyecto en GitHub: https://github.com/IsaPires1329/practica_ml.git.

1.2. Paquestes R utilizados

library(formatR)
library(readr)
library(ggplot2)
library(GGally)
library(dplyr)
library(tidyr)
library(missForest)
library(VIM)
library(formattable)
library(usmap)
library(cowplot)
library(corrplot)
library(MASS)
library(ggfortify)
library(nortest)
library(car)
library(lmtest)
library(PerformanceAnalytics)
library(Amelia)
library(ggthemes)
library(tidyverse)
library(tibble)
library(gridExtra)
library(ggbiplot)
library(factoextra)
library(caret)
library(ISLR)
library(rpart)
library(rpart.plot)
library(rattle)
library(tsne)
library(Rtsne)
library(class)
library(ada)
library(factoextra)
library(cluster)
library(useful)
library(mgcv)
library(xgboost)
library(randomForest)
library(kernlab)
library(pROC)
library(doMC)
library(ggpubr)

2. Conjunto de datos elegido

El conjunto de datos elegido por el grupo se llama “Red Wine Quality” e incluye información sobre la variantes de vino tinto dentro del “Vinho Verde” portugués analizándolo y describiéndolo a través de sus características fisicoquímicas y sensoriales.

Link del data set: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009.

2.1. Carga de los datos

El conjunto de datos “Red Wine Quality” contiene 12 columnas y 1599 filas y lo obtenemos en formato .CSV.

Inicialmente se han guardado los datos en un data frame llamado “red_wine” y se ha realizado un estudio inicial sobre su contenido utilizando la función head y summary.

red_wine <- read_csv("winequality-red.csv")
head(red_wine)
## # A tibble: 6 x 12
##   `fixed acidity` `volatile acidity` `citric acid` `residual sugar` chlorides
##             <dbl>              <dbl>         <dbl>            <dbl>     <dbl>
## 1             7.4               0.7           0                 1.9     0.076
## 2             7.8               0.88          0                 2.6     0.098
## 3             7.8               0.76          0.04              2.3     0.092
## 4            11.2               0.28          0.56              1.9     0.075
## 5             7.4               0.7           0                 1.9     0.076
## 6             7.4               0.66          0                 1.8     0.075
## # ... with 7 more variables: free sulfur dioxide <dbl>,
## #   total sulfur dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <dbl>
summary(red_wine)
##  fixed acidity   volatile acidity  citric acid    residual sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free sulfur dioxide total sulfur dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

2.2. Definición de las variables que componen los datos de estudio

Empezando ya el análisis inicial del conjunto de datos que tenemos, vemos que las 12 variables que componen los datos pueden ser descritas como:

Input variables o Varibles de entrada/predictoras (basado en pruebas fisicoquímicas):

  • fixed acidity: La mayoría de los ácidos involucrados con el vino son o fijos o no volátiles (no se evaporan fácilmente).
  • volatile acidity: La cantidad de ácido acético en el vino, que a niveles demasiado altos puede conducir a un sabor desagradable a vinagre.
  • citric acid: Encontrado en pequeñas cantidades, el ácido cítrico puede agregar “frescura” y sabor a los vinos.
  • residual sugar: La cantidad de azúcar restante después de que se detiene la fermentación, es raro encontrar vinos con menos de 1 gramo / litro y vinos con más de 45 gramos / litro se consideran dulces.
  • chlorides: La cantidad de sal en el vino.
  • free sulfur dioxide: La forma libre de SO2 existe en equilibrio entre el SO2 molecular (como gas disuelto) y el ion bisulfito; previene el crecimiento microbiano y la oxidación del vino
  • total sulfur dioxide: Importe de las formas libres y consolidadas de S02; en bajas concentraciones, el SO2 es en su mayoría indetectable en el vino, pero a concentraciones libres de SO2 superiores a 50 ppm, el SO2 se hace evidente en la nariz y el sabor del vino.
  • density: La densidad es cercana a la del agua dependiendo del porcentaje de alcohol y contenido de azúcar.
  • pH: Describe qué tan ácido o básico es un vino en una escala de 0 (muy ácido) a 14 (muy básico); la mayoría de los vinos están entre 3-4 en la escala de pH.
  • sulphates: Un aditivo del vino que puede contribuir a los niveles de gas de dióxido de azufre (S02), que actúa como antimicrobiano y antioxidante
  • alcohol: El porcentaje de contenido alcohólico del vino (%)

Output variable o Variable de salida/respuesta/objetivo (basado en datos sensoriales):

  • quality: Variable de salida (basada en datos sensoriales, puntuación entre 0 y 10)

2.3. Definición de los objetivos

El objetivo final del proyecto es conseguir llegar a un modelo que permita predecir la calidad del vino tinto de la variedad portuguesa de “Vinho Verde” y saber si estamos ante vinos recomendables (aprobados/bebibles) o no recomendables y que se deberían evitar (suspensos/no bebibles).

2.4. Limpieza inicial del conjunto de datos

2.4.1. Cambio de nombres de las columnas

Se ha decidido realizar un cambio en el nombre de las variables que aparecen en las columnas de los datos para así seguir un mismo patrón y a la vez evitar tener espacios que nos pueden llegar a dar problemas a futuro.

names(red_wine) <- c("fixed_acidity", "volatile_acidity", "citric_acid",
    "residual_sugar", "chlorides", "free_sulfur_dioxide", "total_sulfur_dioxide",
    "density", "pH", "sulphates", "alcohol", "quality")
head(red_wine)
## # A tibble: 6 x 12
##   fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
##           <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
## 1           7.4             0.7         0               1.9     0.076
## 2           7.8             0.88        0               2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.7         0               1.9     0.076
## 6           7.4             0.66        0               1.8     0.075
## # ... with 7 more variables: free_sulfur_dioxide <dbl>,
## #   total_sulfur_dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <dbl>

2.4.2. Cambio de tipo de variable

Todas las variables input de las que disponemos en el dataset son de tipo numérico y entendemos que en principio no requieren ninguna transformación en ese sentido.

Cabría la posibilidad de tratar de transformar la variable “quality” (output) en factor para hacerla categórica en función de la calidad del vino (clasificación del vino en números enteros entre el 0 y el 10). Se podría pasar a categorizar el vino como “malo”, “normal” y “bueno”, como “apobado” o “suspenso”, o del 0 al 10 en las diferentes categorías numéricas que vienen predefinidas.

str(red_wine)
## spec_tbl_df [1,599 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ fixed_acidity       : num [1:1599] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile_acidity    : num [1:1599] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric_acid         : num [1:1599] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual_sugar      : num [1:1599] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num [1:1599] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free_sulfur_dioxide : num [1:1599] 11 25 15 17 11 13 15 15 9 17 ...
##  $ total_sulfur_dioxide: num [1:1599] 34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num [1:1599] 0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num [1:1599] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num [1:1599] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num [1:1599] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : num [1:1599] 5 5 5 6 5 5 5 7 7 5 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `fixed acidity` = col_double(),
##   ..   `volatile acidity` = col_double(),
##   ..   `citric acid` = col_double(),
##   ..   `residual sugar` = col_double(),
##   ..   chlorides = col_double(),
##   ..   `free sulfur dioxide` = col_double(),
##   ..   `total sulfur dioxide` = col_double(),
##   ..   density = col_double(),
##   ..   pH = col_double(),
##   ..   sulphates = col_double(),
##   ..   alcohol = col_double(),
##   ..   quality = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

3. Añadiendo datos faltantes al data set

A través de la función summary empezamos comprobando que no hay datos faltantes en el data set. Por ello el grupo ha tenido que añadirlos manualmente para tratar de aproximarnos a un caso más real donde lo normal es encontrarlos y tener que lidiar con ellos.

Los datos faltantes han sido imputados exclusivamente en las columnas que no creemos que no van a servir de análisis principal para este estudio (pH y sulphates), para así intentar que la predicción que hagamos sea lo más precisa posible.

Utilizamos la librería missForest y generamos una semilla para que el resultado sea siempre el mismo.

red_wine
## # A tibble: 1,599 x 12
##    fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
##            <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
##  1           7.4             0.7         0               1.9     0.076
##  2           7.8             0.88        0               2.6     0.098
##  3           7.8             0.76        0.04            2.3     0.092
##  4          11.2             0.28        0.56            1.9     0.075
##  5           7.4             0.7         0               1.9     0.076
##  6           7.4             0.66        0               1.8     0.075
##  7           7.9             0.6         0.06            1.6     0.069
##  8           7.3             0.65        0               1.2     0.065
##  9           7.8             0.58        0.02            2       0.073
## 10           7.5             0.5         0.36            6.1     0.071
## # ... with 1,589 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## #   total_sulfur_dioxide <dbl>, density <dbl>, pH <dbl>, sulphates <dbl>,
## #   alcohol <dbl>, quality <dbl>
set.seed(101)
red_wine <- bind_cols(red_wine[c(1, 2, 3, 4, 5, 6, 7, 8, 11,
    12)], missForest::prodNA(red_wine[c(-1, -2, -3, -4, -5, -6,
    -7, -8, -11, -12)], noNA = 0.1))
red_wine
## # A tibble: 1,599 x 12
##    fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
##            <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
##  1           7.4             0.7         0               1.9     0.076
##  2           7.8             0.88        0               2.6     0.098
##  3           7.8             0.76        0.04            2.3     0.092
##  4          11.2             0.28        0.56            1.9     0.075
##  5           7.4             0.7         0               1.9     0.076
##  6           7.4             0.66        0               1.8     0.075
##  7           7.9             0.6         0.06            1.6     0.069
##  8           7.3             0.65        0               1.2     0.065
##  9           7.8             0.58        0.02            2       0.073
## 10           7.5             0.5         0.36            6.1     0.071
## # ... with 1,589 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## #   total_sulfur_dioxide <dbl>, density <dbl>, alcohol <dbl>, quality <dbl>,
## #   pH <dbl>, sulphates <dbl>

Haciendo uso de la librería VIM y de la librería Amelia, analizamos la estructura que tienen nuestros datos faltantes dentro de nuestro data set para ver y entender como se distribuyen y a que variables afecta.

Se puede comprobar que la proporción de datos faltantes en estas variables es de aproximadamente 10% y hay 15 filas en que las dos variables son faltantes.

summary(aggr(red_wine, numbers = T, sortVar = T))

## 
##  Variables sorted by number of missings: 
##              Variable      Count
##                    pH 0.10318949
##             sulphates 0.09631019
##         fixed_acidity 0.00000000
##      volatile_acidity 0.00000000
##           citric_acid 0.00000000
##        residual_sugar 0.00000000
##             chlorides 0.00000000
##   free_sulfur_dioxide 0.00000000
##  total_sulfur_dioxide 0.00000000
##               density 0.00000000
##               alcohol 0.00000000
##               quality 0.00000000
## 
##  Missings per variable: 
##              Variable Count
##         fixed_acidity     0
##      volatile_acidity     0
##           citric_acid     0
##        residual_sugar     0
##             chlorides     0
##   free_sulfur_dioxide     0
##  total_sulfur_dioxide     0
##               density     0
##               alcohol     0
##               quality     0
##                    pH   165
##             sulphates   154
## 
##  Missings in combinations of variables: 
##             Combinations Count    Percent
##  0:0:0:0:0:0:0:0:0:0:0:0  1295 80.9881176
##  0:0:0:0:0:0:0:0:0:0:0:1   139  8.6929331
##  0:0:0:0:0:0:0:0:0:0:1:0   150  9.3808630
##  0:0:0:0:0:0:0:0:0:0:1:1    15  0.9380863
missmap(red_wine, main = "Missing Map")

4. Partición del conjunto de datos: data set training y data set test

Una vez vistos por encima la estructura general de los datos y habiendo añadido los datos faltantes que nos hacian falta, pasamos a dividir el conjunto de datos en dos para diferenciar los que usaremos de entrenamiento de los que usaremos de test (viendo la cantidad de datos de la que disponemos, la distribución elegida ha sido: 20% test y 80% training). Establecemos una semilla que nos guarde de forma permanente la división que hacemos para que la distribución de los datos sea siempre la misma.

Guardamos además la partición de datos de test para ser utilizada a futuro para la validación del modelo final y pasamos a trabajar de aquí en adelante con la partición de training.

set.seed(101)
sample <- sample.int(n = nrow(red_wine), size = floor(0.8 * nrow(red_wine)),
    replace = F)
train <- red_wine[sample, ]
test <- red_wine[-sample, ]
train
## # A tibble: 1,279 x 12
##    fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
##            <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
##  1           7.1            0.48         0.28            2.8     0.068
##  2           7.6            0.49         0.33            1.9     0.074
##  3           5              1.02         0.04            1.4     0.045
##  4           7.6            0.43         0.29            2.1     0.075
##  5           6.8            0.59         0.1             1.7     0.063
##  6           6.8            0.815        0               1.2     0.267
##  7           8.5            0.21         0.52            1.9     0.09 
##  8           7.4            0.36         0.29            2.6     0.087
##  9           5.5            0.49         0.03            1.8     0.044
## 10           6.8            0.49         0.22            2.3     0.071
## # ... with 1,269 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## #   total_sulfur_dioxide <dbl>, density <dbl>, alcohol <dbl>, quality <dbl>,
## #   pH <dbl>, sulphates <dbl>
test
## # A tibble: 320 x 12
##    fixed_acidity volatile_acidity citric_acid residual_sugar chlorides
##            <dbl>            <dbl>       <dbl>          <dbl>     <dbl>
##  1           7.4             0.7         0               1.9     0.076
##  2           7.3             0.65        0               1.2     0.065
##  3           8.9             0.22        0.48            1.8     0.077
##  4           7.6             0.41        0.24            1.8     0.08 
##  5           7.1             0.71        0               1.9     0.08 
##  6           5.7             1.13        0.09            1.5     0.172
##  7           7.3             0.45        0.36            5.9     0.074
##  8           8.1             0.66        0.22            2.2     0.069
##  9           6.8             0.67        0.02            1.8     0.05 
## 10           5.6             0.31        0.37            1.4     0.074
## # ... with 310 more rows, and 7 more variables: free_sulfur_dioxide <dbl>,
## #   total_sulfur_dioxide <dbl>, density <dbl>, alcohol <dbl>, quality <dbl>,
## #   pH <dbl>, sulphates <dbl>

5. Detección, tratamiento e imputación de datos faltantes

Para la imputación de datos faltantes en las columnas “pH” y “sulphates”, se ha decidido reemplazar todos sus NAs según los valores medianos de las mismas variables.

Con la función summary se comprueba que ya no hay más datos faltantes en el data set train.

train$pH[is.na(train$pH)] <- median(train$pH, na.rm = TRUE)
train$sulphates[is.na(train$sulphates)] <- median(train$sulphates,
    na.rm = TRUE)
summary(train)
##  fixed_acidity    volatile_acidity  citric_acid     residual_sugar  
##  Min.   : 4.600   Min.   :0.1200   Min.   :0.0000   Min.   : 0.900  
##  1st Qu.: 7.100   1st Qu.:0.3900   1st Qu.:0.1000   1st Qu.: 1.900  
##  Median : 7.900   Median :0.5200   Median :0.2600   Median : 2.200  
##  Mean   : 8.357   Mean   :0.5262   Mean   :0.2732   Mean   : 2.552  
##  3rd Qu.: 9.300   3rd Qu.:0.6300   3rd Qu.:0.4300   3rd Qu.: 2.600  
##  Max.   :15.900   Max.   :1.5800   Max.   :0.7900   Max.   :15.500  
##    chlorides      free_sulfur_dioxide total_sulfur_dioxide    density      
##  Min.   :0.0120   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.0710   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.0800   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.0882   Mean   :15.86       Mean   : 46.44       Mean   :0.9968  
##  3rd Qu.:0.0910   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9979  
##  Max.   :0.6110   Max.   :68.00       Max.   :289.00       Max.   :1.0037  
##     alcohol         quality            pH          sulphates     
##  Min.   : 8.40   Min.   :3.000   Min.   :2.860   Min.   :0.3300  
##  1st Qu.: 9.50   1st Qu.:5.000   1st Qu.:3.220   1st Qu.:0.5600  
##  Median :10.20   Median :6.000   Median :3.300   Median :0.6200  
##  Mean   :10.43   Mean   :5.635   Mean   :3.308   Mean   :0.6526  
##  3rd Qu.:11.10   3rd Qu.:6.000   3rd Qu.:3.380   3rd Qu.:0.7100  
##  Max.   :14.90   Max.   :8.000   Max.   :4.010   Max.   :1.9500

6. EDA - Análisis exploratorio de datos

6.1. Análisis de la distribución de las variables

Analizamos como se distribuyen las diferentes variables de nuestro dataset.

train %>%
    keep(is.numeric) %>%
    gather() %>%
    ggplot(aes(value, fill = key)) + facet_wrap(~key, scales = "free") +
    geom_histogram(bins = sqrt(nrow(train))) + theme(legend.position = "none")

A partir de las gráficas podemos ver que algunas de las variables están distribuidas de forma normal, y parte de las variables están sesgadas a la derecha.

La distribución de “fixed_acidity” y “volatile_acidity” es muy similar, lo que indica que hay ciertas similitudes entre los dos indicadores fisicoquímicos.

Las variables “density” y el “pH” se distribuyen normalmente, lo que indica que todos los vinos tintos tienen poca diferencia en estos dos indicadores. No se requiere por tanto transformación alguna de su distribución.

Las variables “residual_sugar”, “chlorides”, “free_sulfur_dioxide”, “total_sulfur_dioxide”, and “sulphates” están muy sesgadas, por lo qye sería conveniente transformarlas para que la distribución de sus valores fuese más homogénea. Este resultado se consigue aplicando una transformación logarítmica y normalizando de esa manera sus distribuciones:

train <- train %>%
    mutate(Log_residual_sugar = log(residual_sugar), Log_chlorides = log(chlorides),
        Log_free_sulfur_dioxide = log(free_sulfur_dioxide), Log_total_sulfur_dioxide = log(total_sulfur_dioxide),
        Log_sulphates = log(sulphates))

ga <- train %>%
    ggplot(aes(x = Log_residual_sugar)) + geom_histogram(bins = 20,
    fill = "#619CFF")
gb <- train %>%
    ggplot(aes(x = Log_chlorides)) + geom_histogram(bins = 20,
    fill = "#E58700")
gc <- train %>%
    ggplot(aes(x = Log_free_sulfur_dioxide)) + geom_histogram(bins = 20,
    fill = "#00BF7D")
gd <- train %>%
    ggplot(aes(x = Log_total_sulfur_dioxide)) + geom_histogram(bins = 20,
    fill = "#FD61D1")
ge <- train %>%
    ggplot(aes(x = Log_sulphates)) + geom_histogram(bins = 20,
    fill = "#B983FF")

grid.arrange(ga, gb, gc, gd, ge)

Modificamos nuestro dataset original para que las variables transformadas a logaritmos sustituyan a las mismas pero aún sin transformar. Tendremos de ese modo un dataset con 12 variables también, pero 5 de ellas transformadas a logaritmos.

train <- train %>%
    dplyr::select(-residual_sugar, -chlorides, -free_sulfur_dioxide,
        -total_sulfur_dioxide, -sulphates)
train %>%
    summary
##  fixed_acidity    volatile_acidity  citric_acid        density      
##  Min.   : 4.600   Min.   :0.1200   Min.   :0.0000   Min.   :0.9901  
##  1st Qu.: 7.100   1st Qu.:0.3900   1st Qu.:0.1000   1st Qu.:0.9956  
##  Median : 7.900   Median :0.5200   Median :0.2600   Median :0.9968  
##  Mean   : 8.357   Mean   :0.5262   Mean   :0.2732   Mean   :0.9968  
##  3rd Qu.: 9.300   3rd Qu.:0.6300   3rd Qu.:0.4300   3rd Qu.:0.9979  
##  Max.   :15.900   Max.   :1.5800   Max.   :0.7900   Max.   :1.0037  
##     alcohol         quality            pH        Log_residual_sugar
##  Min.   : 8.40   Min.   :3.000   Min.   :2.860   Min.   :-0.1054   
##  1st Qu.: 9.50   1st Qu.:5.000   1st Qu.:3.220   1st Qu.: 0.6419   
##  Median :10.20   Median :6.000   Median :3.300   Median : 0.7885   
##  Mean   :10.43   Mean   :5.635   Mean   :3.308   Mean   : 0.8554   
##  3rd Qu.:11.10   3rd Qu.:6.000   3rd Qu.:3.380   3rd Qu.: 0.9555   
##  Max.   :14.90   Max.   :8.000   Max.   :4.010   Max.   : 2.7408   
##  Log_chlorides     Log_free_sulfur_dioxide Log_total_sulfur_dioxide
##  Min.   :-4.4228   Min.   :0.000           Min.   :1.792           
##  1st Qu.:-2.6451   1st Qu.:1.946           1st Qu.:3.091           
##  Median :-2.5257   Median :2.639           Median :3.638           
##  Mean   :-2.4980   Mean   :2.545           Mean   :3.604           
##  3rd Qu.:-2.3969   3rd Qu.:3.045           3rd Qu.:4.127           
##  Max.   :-0.4927   Max.   :4.220           Max.   :5.666           
##  Log_sulphates    
##  Min.   :-1.1087  
##  1st Qu.:-0.5798  
##  Median :-0.4780  
##  Mean   :-0.4496  
##  3rd Qu.:-0.3425  
##  Max.   : 0.6678

Una vez realizadas las transformaciones logaritmicas oportunas sobre las 5 variables que lo requerían, volvemos a ver de forma general las distribuciones del conjunto total de variables:

train %>%
    keep(is.numeric) %>%
    gather() %>%
    ggplot(aes(value, fill = key)) + facet_wrap(~key, scales = "free") +
    geom_histogram(bins = sqrt(nrow(train))) + theme(legend.position = "none")

Analizamos en detalle como se distribuye la variable de salida “quality” referente a las puntuaciones de calidad de entre 0 y 10 de los vinos.

ggplot(data = train) + geom_bar(mapping = aes(x = quality, fill = as.factor(quality))) +
    labs(title = "Histograma Calidad Vino")

table(train$quality)
## 
##   3   4   5   6   7   8 
##   7  38 552 513 156  13
prop.table(table(train$quality))
## 
##           3           4           5           6           7           8 
## 0.005473026 0.029710711 0.431587177 0.401094605 0.121970289 0.010164191

Con la gráfica y los datos podemos ver que la mayor parte de los vinos (sobre un 83% de ellos) están clasificados con valores de calidad de 5 y 6, sobre calificaciones que van de 0 a 10.

6.2. Boxplot - análisis de la variables de relevancia y de los atípicos observados

Analizamos si nuestras variables tienen valores atípicos, cuales son sus valores medios y vemos sus intervalos de confianza, a través de gráficos de tipo Boxplot.

Boxplot variable alcohol

BoxPlot_alcohol <- ggplot(train, aes(x = factor(quality), y = alcohol)) +
    geom_boxplot() + geom_boxplot(fill = "#F8766D") + ggtitle("Boxplot alcohol")
BoxPlot_alcohol

Apreciamos que los vinos con mejor puntuación en “quality” tienen en general mayor % de alcohol.

Boxplot variable citric_acid

BoxPlot_citric_acid <- ggplot(train, aes(x = factor(quality),
    y = citric_acid)) + geom_boxplot() + geom_boxplot(fill = "#E58700") +
    ggtitle("Boxplot citric_acid")
BoxPlot_citric_acid

Apreciamos que los vinos con mejor puntuación en “quality” tienen en general mayor cantidad de ácido cítrico.

Boxplot variable density

BoxPlot_density <- ggplot(train, aes(x = factor(quality), y = density)) +
    geom_boxplot() + geom_boxplot(fill = "#C99800") + ggtitle("Boxplot density")
BoxPlot_density

Apreciamos que los vinos con mejor puntuación en “quality” tienen en general una leve menor densidad, pero no es una variable determinante en la calidad del producto.

Boxplot variable fixed_acidity

BoxPlot_fixed_acidity <- ggplot(train, aes(x = factor(quality),
    y = fixed_acidity)) + geom_boxplot() + geom_boxplot(fill = "#6BB100") +
    ggtitle("Boxplot fixed_acidity")
BoxPlot_fixed_acidity

Apreciamos que la variable “fixed_acidity” se mantiene bastante estable independientemente de la calidad final del vino, sin tener grandes diferencias entre los diferentes rangos de calidad.

Boxplot variable Log_chlorides

BoxPlot_Log_chlorides <- ggplot(train, aes(x = factor(quality),
    y = Log_chlorides)) + geom_boxplot() + geom_boxplot(fill = "#00BA38") +
    ggtitle("Boxplot Log_chlorides")
BoxPlot_Log_chlorides

Apreciamos que la variable “Log_chlorides” se mantiene bastante estable independientemente de la calidad final del vino, sin tener grandes diferencias entre los diferentes rangos de calidad.

Boxplot variable Log_free_sulfur_dioxide

BoxPlot_Log_free_sulfur_dioxide <- ggplot(train, aes(x = factor(quality),
    y = Log_free_sulfur_dioxide)) + geom_boxplot() + geom_boxplot(fill = "#00BF7D") +
    ggtitle("Boxplot Log_free_sulfur_dioxide")
BoxPlot_Log_free_sulfur_dioxide

No se aprecia una tendencia específica en la variable “Log_free_sulfur_dioxide” que sea decisiva en la calidad del vino.

Boxplot variable Log_residual_sugar

BoxPlot_Log_residual_sugar <- ggplot(train, aes(x = factor(quality),
    y = Log_residual_sugar)) + geom_boxplot() + geom_boxplot(fill = "#00C0AF") +
    ggtitle("Boxplot Log_residual_sugar")
BoxPlot_Log_residual_sugar

Apreciamos que la variable “Log_residual_sugar” se mantiene bastante estable independientemente de la calidad final del vino, sin tener grandes diferencias entre los diferentes rangos de calidad.

Boxplot variable Log_sulphates

BoxPlot_Log_sulphates <- ggplot(train, aes(x = factor(quality),
    y = Log_sulphates)) + geom_boxplot() + geom_boxplot(fill = "#00BCD8") +
    ggtitle("Boxplot Log_sulphates")
BoxPlot_Log_sulphates

Apreciamos que los vinos con mejor puntuación en “quality” tienen en general mayor cantidad de la variable “Log_sulphates”, aunque existen bastantes outlier e puntuaciones de 5 y 6, que podrían llevar a error.

Boxplot variable Log_total_sulfur_dioxide

BoxPlot_Log_total_sulfur_dioxide <- ggplot(train, aes(x = factor(quality),
    y = Log_total_sulfur_dioxide)) + geom_boxplot() + geom_boxplot(fill = "#00B0F6") +
    ggtitle("Boxplot Log_total_sulfur_dioxide")
BoxPlot_Log_total_sulfur_dioxide

No se aprecia una tendencia específica en la variable “Log_total_sulfur_dioxide” que sea decisiva en la calidad del vino.

Boxplot variable pH

BoxPlot_pH <- ggplot(train, aes(x = factor(quality), y = pH)) +
    geom_boxplot() + geom_boxplot(fill = "#B983FF") + ggtitle("Boxplot pH")
BoxPlot_pH

Apreciamos que los vinos con mejor puntuación en “quality” tienen en general un leve menor valor de pH,aunque existen numeros outliers en vinos puntuados con 5 y 6 que podrían llevar a error.

Boxplot variable volatile_acidity

BoxPlot_volatile_acidity <- ggplot(train, aes(x = factor(quality),
    y = volatile_acidity)) + geom_boxplot() + geom_boxplot(fill = "#FF67A4") +
    ggtitle("Boxplot volatile_acidity")
BoxPlot_volatile_acidity

Apreciamos que los vinos con mejor puntuación en “quality” tienen en general menor cantidad de “ácido cítrico”volatile_acidity".

6.3. Correlación entre variables

Continuando con en análisis de las distintas variables del data set y el estudio de como se relacionan entre si, queremos ver de forma global como se correlacionan las variables numéricas que nos pueden llegar a servir para el modelo de predicción.

6.3.2. Análisis de la correlación global del conjunto de variables

pairs(train)

corrplot(cor(train %>%
    mutate(quality = as.numeric(quality)) %>%
    keep(is.numeric)))

res <- cor(train %>%
    mutate(quality = as.numeric(quality)) %>%
    keep(is.numeric))
round(res, 2)
##                          fixed_acidity volatile_acidity citric_acid density
## fixed_acidity                     1.00            -0.26        0.68    0.68
## volatile_acidity                 -0.26             1.00       -0.55    0.02
## citric_acid                       0.68            -0.55        1.00    0.37
## density                           0.68             0.02        0.37    1.00
## alcohol                          -0.05            -0.21        0.15   -0.49
## quality                           0.14            -0.39        0.25   -0.16
## pH                               -0.64             0.22       -0.49   -0.32
## Log_residual_sugar                0.20             0.04        0.19    0.44
## Log_chlorides                     0.16             0.09        0.16    0.33
## Log_free_sulfur_dioxide          -0.18             0.03       -0.11   -0.04
## Log_total_sulfur_dioxide         -0.12             0.08       -0.03    0.11
## Log_sulphates                     0.19            -0.30        0.32    0.14
##                          alcohol quality    pH Log_residual_sugar Log_chlorides
## fixed_acidity              -0.05    0.14 -0.64               0.20          0.16
## volatile_acidity           -0.21   -0.39  0.22               0.04          0.09
## citric_acid                 0.15    0.25 -0.49               0.19          0.16
## density                    -0.49   -0.16 -0.32               0.44          0.33
## alcohol                     1.00    0.49  0.18               0.06         -0.29
## quality                     0.49    1.00 -0.07               0.03         -0.16
## pH                          0.18   -0.07  1.00              -0.10         -0.26
## Log_residual_sugar          0.06    0.03 -0.10               1.00          0.12
## Log_chlorides              -0.29   -0.16 -0.26               0.12          1.00
## Log_free_sulfur_dioxide    -0.09   -0.03  0.08               0.10         -0.02
## Log_total_sulfur_dioxide   -0.24   -0.16 -0.03               0.17          0.06
## Log_sulphates               0.13    0.33 -0.14               0.02          0.22
##                          Log_free_sulfur_dioxide Log_total_sulfur_dioxide
## fixed_acidity                              -0.18                    -0.12
## volatile_acidity                            0.03                     0.08
## citric_acid                                -0.11                    -0.03
## density                                    -0.04                     0.11
## alcohol                                    -0.09                    -0.24
## quality                                    -0.03                    -0.16
## pH                                          0.08                    -0.03
## Log_residual_sugar                          0.10                     0.17
## Log_chlorides                              -0.02                     0.06
## Log_free_sulfur_dioxide                     1.00                     0.79
## Log_total_sulfur_dioxide                    0.79                     1.00
## Log_sulphates                               0.06                     0.04
##                          Log_sulphates
## fixed_acidity                     0.19
## volatile_acidity                 -0.30
## citric_acid                       0.32
## density                           0.14
## alcohol                           0.13
## quality                           0.33
## pH                               -0.14
## Log_residual_sugar                0.02
## Log_chlorides                     0.22
## Log_free_sulfur_dioxide           0.06
## Log_total_sulfur_dioxide          0.04
## Log_sulphates                     1.00

Vemos que las variables que más estan correlacionadas con la variable “quality” son: “volatile_acidity”, “citric_acid”, “alcohol” y “Log_sulphates”.

6.3.1. Análisis de la correlación bivariante

Realizamos un análisis bivariante para ver que variables están más correlacionadas, positva o negativamente, entre si.

Correlación: fixed_acidity y citric_acid:

cor(x = train$fixed_acidity, y = train$citric_acid)
## [1] 0.678372
train %>%
    ggplot(aes(fixed_acidity, citric_acid)) + geom_point(alpha = 0.2,
    colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
    labs(title = "Relación entre variables fixed_acidity y citric_acid",
        x = "fixed_acidity", y = "citric_acid")

Correlación: fixed_acidity y density:

cor(x = train$fixed_acidity, y = train$density)
## [1] 0.6782196
train %>%
    ggplot(aes(fixed_acidity, density)) + geom_point(alpha = 0.2,
    colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
    labs(title = "Relación entre variables fixed_acidity y density",
        x = "fixed_acidity", y = "density")

Correlación: fixed_acidity y pH:

cor(x = train$fixed_acidity, y = train$pH)
## [1] -0.644656
train %>%
    ggplot(aes(fixed_acidity, pH)) + geom_point(alpha = 0.2,
    colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
    labs(title = "Relación entre variables fixed_acidity y pH",
        x = "fixed_acidity", y = "pH")

Correlación: citric_acid y volatile_acidity:

cor(x = train$citric_acid, y = train$volatile_acidity)
## [1] -0.5538307
train %>%
    ggplot(aes(citric_acid, volatile_acidity)) + geom_point(alpha = 0.2,
    colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
    labs(title = "Relación entre variables citric_acid y volatile_acidity",
        x = "citric_acid", y = "volatile_acidity")

Correlación: citric_acid y pH:

cor(x = train$citric_acid, y = train$pH)
## [1] -0.4941459
train %>%
    ggplot(aes(citric_acid, pH)) + geom_point(alpha = 0.2, colour = "green") +
    geom_smooth(formula = "y ~ x", method = "lm") + labs(title = "Relación entre variables citric_acid y pH",
    x = "citric_acid", y = "pH")

Correlación: density y Log_residual_sugar:

cor(x = train$density, y = train$Log_residual_sugar)
## [1] 0.4399375
train %>%
    ggplot(aes(density, Log_residual_sugar)) + geom_point(alpha = 0.2,
    colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
    labs(title = "Relación entre variables density y Log_residual_sugar",
        x = "density", y = "Log_residual_sugar")

Correlación: density y alcohol:

cor(x = train$density, y = train$alcohol)
## [1] -0.4880924
train %>%
    ggplot(aes(density, alcohol)) + geom_point(alpha = 0.2, colour = "green") +
    geom_smooth(formula = "y ~ x", method = "lm") + labs(title = "Relación entre variables density y alcohol",
    x = "density", y = "alcohol")

Correlación: quality y alcohol:

cor(x = train$quality, y = train$alcohol)
## [1] 0.4895963
train %>%
    ggplot(aes(train$quality, train$alcohol)) + geom_point(alpha = 0.2,
    colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
    labs(title = "Relación entre variables quality y alcohol",
        x = "quality", y = "alcohol")

Correlación: quality y volatile_acidity:

cor(x = train$quality, y = train$volatile_acidity)
## [1] -0.3904367
train %>%
    ggplot(aes(quality, volatile_acidity)) + geom_point(alpha = 0.2,
    colour = "green") + geom_smooth(formula = "y ~ x", method = "lm") +
    labs(title = "Relación entre variables quality y volatile_acidity",
        x = "quality", y = "volatile_acidity")

Correlación: Log_free_sulfur_dioxide y Log_total_sulfur_dioxide:

cor(x = train$Log_free_sulfur_dioxide, y = train$Log_total_sulfur_dioxide)
## [1] 0.7856495
train %>%
    ggplot(aes(Log_free_sulfur_dioxide, Log_total_sulfur_dioxide)) +
    geom_point(alpha = 0.2, colour = "green") + geom_smooth(formula = "y ~ x",
    method = "lm") + labs(title = "Relación entre variables Log_free_sulfur_dioxide y Log_total_sulfur_dioxide",
    x = "Log_free_sulfur_dioxide", y = "Log_total_sulfur_dioxide")

7. Cambios, modificaciones y transformaciones sobre el dataset de test

Realizamos los cambios y modificaciones necesarias sobre el conjunto de datos de test, aplicados previamente sobre nuestro dataset de train.

Imputamos los NAs del data set de test:

test$pH[is.na(test$pH)] <- median(test$pH, na.rm = TRUE)
test$sulphates[is.na(test$sulphates)] <- median(test$sulphates,
    na.rm = TRUE)
summary(test)
##  fixed_acidity   volatile_acidity  citric_acid    residual_sugar  
##  Min.   : 4.70   Min.   :0.1600   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.080   1st Qu.: 1.900  
##  Median : 7.80   Median :0.5200   Median :0.250   Median : 2.150  
##  Mean   : 8.17   Mean   :0.5341   Mean   :0.262   Mean   : 2.486  
##  3rd Qu.: 9.00   3rd Qu.:0.6600   3rd Qu.:0.420   3rd Qu.: 2.525  
##  Max.   :15.00   Max.   :1.2400   Max.   :1.000   Max.   :13.800  
##    chlorides       free_sulfur_dioxide total_sulfur_dioxide    density      
##  Min.   :0.01200   Min.   : 3.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.06800   1st Qu.: 7.00       1st Qu.: 20.00       1st Qu.:0.9957  
##  Median :0.07800   Median :14.00       Median : 37.00       Median :0.9967  
##  Mean   :0.08452   Mean   :15.95       Mean   : 46.58       Mean   :0.9967  
##  3rd Qu.:0.08725   3rd Qu.:22.00       3rd Qu.: 63.25       3rd Qu.:0.9977  
##  Max.   :0.61000   Max.   :72.00       Max.   :160.00       Max.   :1.0024  
##     alcohol         quality            pH          sulphates     
##  Min.   : 8.80   Min.   :3.000   Min.   :2.740   Min.   :0.3900  
##  1st Qu.: 9.50   1st Qu.:5.000   1st Qu.:3.210   1st Qu.:0.5500  
##  Median :10.10   Median :6.000   Median :3.320   Median :0.6200  
##  Mean   :10.39   Mean   :5.641   Mean   :3.315   Mean   :0.6583  
##  3rd Qu.:11.00   3rd Qu.:6.000   3rd Qu.:3.400   3rd Qu.:0.7100  
##  Max.   :14.00   Max.   :8.000   Max.   :3.850   Max.   :2.0000

Transformamos a logaritmicas las variables previamente normalizadas:

test <- test %>%
    mutate(Log_residual_sugar = log(residual_sugar), Log_chlorides = log(chlorides),
        Log_free_sulfur_dioxide = log(free_sulfur_dioxide), Log_total_sulfur_dioxide = log(total_sulfur_dioxide),
        Log_sulphates = log(sulphates))

Modificamos nuestro dataset de test para que las variables transformadas a logaritmos sustituyan a las mismas pero aún sin transformar. Tendremos de ese modo un dataset con 12 variables también, pero 5 de ellas transformadas a logaritmos.

test <- test %>%
    dplyr::select(-residual_sugar, -chlorides, -free_sulfur_dioxide,
        -total_sulfur_dioxide, -sulphates)
head(test)
## # A tibble: 6 x 12
##   fixed_acidity volatile_acidity citric_acid density alcohol quality    pH
##           <dbl>            <dbl>       <dbl>   <dbl>   <dbl>   <dbl> <dbl>
## 1           7.4             0.7         0      0.998     9.4       5  3.51
## 2           7.3             0.65        0      0.995    10         7  3.39
## 3           8.9             0.22        0.48   0.997     9.4       6  3.39
## 4           7.6             0.41        0.24   0.996     9.5       5  3.28
## 5           7.1             0.71        0      0.997     9.4       5  3.47
## 6           5.7             1.13        0.09   0.994     9.8       4  3.5 
## # ... with 5 more variables: Log_residual_sugar <dbl>, Log_chlorides <dbl>,
## #   Log_free_sulfur_dioxide <dbl>, Log_total_sulfur_dioxide <dbl>,
## #   Log_sulphates <dbl>

8. Regresion lineal multiple

Una vez analizado en profundidad nuestro conjunto de datos y habiendo entendido y tranformado nuetras variables, trataremos de ajustar un modelo de regresión lineal múltiple que trate de predicir la calidad del vino tinto de la variedad portuguesa de “Vinho Verde”.

8.1. Ajuste del modelo de regresión

Ajustamos un modelo de regresión lineal mútiple con el que vamos a predecir el valor de la variable quality a partir de las siguientes variables independientes(cogemos todas las variables menos “Log_residual_sugar” que no presenta ninguna correlación con la variable “quality”) seleccionadas en base a los análisis y estudios de correlación vistos con anterioridad.

modelo = lm(quality ~ alcohol + fixed_acidity + volatile_acidity +
    citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
    Log_free_sulfur_dioxide + density + pH + Log_sulphates, data = train)

summary(modelo)
## 
## Call:
## lm(formula = quality ~ alcohol + fixed_acidity + volatile_acidity + 
##     citric_acid + Log_chlorides + Log_total_sulfur_dioxide + 
##     Log_free_sulfur_dioxide + density + pH + Log_sulphates, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.27885 -0.34712 -0.05254  0.43254  1.89518 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -9.94814   18.67185  -0.533 0.594274    
## alcohol                   0.30374    0.02413  12.587  < 2e-16 ***
## fixed_acidity             0.01981    0.02396   0.827 0.408630    
## volatile_acidity         -1.04091    0.12864  -8.092 1.37e-15 ***
## citric_acid              -0.31703    0.15340  -2.067 0.038963 *  
## Log_chlorides            -0.22535    0.06205  -3.632 0.000293 ***
## Log_total_sulfur_dioxide -0.15705    0.04408  -3.563 0.000380 ***
## Log_free_sulfur_dioxide   0.12800    0.04305   2.973 0.003001 ** 
## density                  14.65001   19.00095   0.771 0.440842    
## pH                       -0.50069    0.18785  -2.665 0.007788 ** 
## Log_sulphates             0.85751    0.09545   8.984  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6241 on 1268 degrees of freedom
## Multiple R-squared:  0.3862, Adjusted R-squared:  0.3813 
## F-statistic: 79.77 on 10 and 1268 DF,  p-value: < 2.2e-16

8.2. Seleccion de variables para modelo de regresión

Para la selección de variables se utiliza el método de la selección automática por pasos.

empty.model <- lm(quality ~ 1, data = train)
horizonte <- formula(quality ~ alcohol + fixed_acidity + volatile_acidity +
    citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
    Log_free_sulfur_dioxide + density + pH + Log_sulphates)
# metodo de selección por pasos e indica las variables que
# son significativas
seleccion = stepAIC(empty.model, direction = c("both"), trace = FALSE,
    scope = horizonte)
seleccion$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## quality ~ 1
## 
## Final Model:
## quality ~ alcohol + volatile_acidity + Log_sulphates + Log_chlorides + 
##     pH + Log_total_sulfur_dioxide + Log_free_sulfur_dioxide
## 
## 
##                         Step Df   Deviance Resid. Df Resid. Dev        AIC
## 1                                               1278   804.4848  -590.9851
## 2                  + alcohol  1 192.838647      1277   611.6461  -939.4926
## 3         + volatile_acidity  1  69.859784      1276   541.7863 -1092.6125
## 4            + Log_sulphates  1  29.717001      1275   512.0693 -1162.7631
## 5            + Log_chlorides  1   4.327006      1274   507.7423 -1171.6166
## 6                       + pH  1   5.098410      1273   502.6439 -1182.5244
## 7 + Log_total_sulfur_dioxide  1   2.458061      1272   500.1858 -1186.7944
## 8  + Log_free_sulfur_dioxide  1   3.993776      1271   496.1921 -1195.0476

Vemos la información del modelo elegido como “mejor”

summary(seleccion)
## 
## Call:
## lm(formula = quality ~ alcohol + volatile_acidity + Log_sulphates + 
##     Log_chlorides + pH + Log_total_sulfur_dioxide + Log_free_sulfur_dioxide, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.24223 -0.35766 -0.05925  0.43097  1.88984 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               4.97311    0.44122  11.271  < 2e-16 ***
## alcohol                   0.28289    0.01841  15.366  < 2e-16 ***
## volatile_acidity         -0.90819    0.10940  -8.301 2.60e-16 ***
## Log_sulphates             0.86434    0.09387   9.208  < 2e-16 ***
## Log_chlorides            -0.23817    0.05970  -3.989 7.00e-05 ***
## pH                       -0.52666    0.13211  -3.987 7.08e-05 ***
## Log_total_sulfur_dioxide -0.17181    0.04226  -4.065 5.10e-05 ***
## Log_free_sulfur_dioxide   0.13511    0.04224   3.198  0.00142 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6248 on 1271 degrees of freedom
## Multiple R-squared:  0.3832, Adjusted R-squared:  0.3798 
## F-statistic: 112.8 on 7 and 1271 DF,  p-value: < 2.2e-16

Nos quedamos con el modelo seleccionado como el mejor para la regresión según el método utilizado anteriormente.

mejor_modelo = lm(quality ~ alcohol + volatile_acidity + Log_sulphates +
    Log_chlorides + pH + Log_total_sulfur_dioxide + citric_acid +
    fixed_acidity, data = train)

summary(mejor_modelo)
## 
## Call:
## lm(formula = quality ~ alcohol + volatile_acidity + Log_sulphates + 
##     Log_chlorides + pH + Log_total_sulfur_dioxide + citric_acid + 
##     fixed_acidity, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.31169 -0.35282 -0.05415  0.42911  1.86472 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               4.27164    0.63721   6.704 3.05e-11 ***
## alcohol                   0.30000    0.01884  15.921  < 2e-16 ***
## volatile_acidity         -1.07446    0.12691  -8.466  < 2e-16 ***
## Log_sulphates             0.88622    0.09458   9.370  < 2e-16 ***
## Log_chlorides            -0.21728    0.06080  -3.574 0.000365 ***
## pH                       -0.40708    0.16560  -2.458 0.014097 *  
## Log_total_sulfur_dioxide -0.05076    0.02651  -1.915 0.055768 .  
## citric_acid              -0.38575    0.15186  -2.540 0.011197 *  
## fixed_acidity             0.03469    0.01610   2.155 0.031357 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6258 on 1270 degrees of freedom
## Multiple R-squared:  0.3818, Adjusted R-squared:  0.3779 
## F-statistic: 98.03 on 8 and 1270 DF,  p-value: < 2.2e-16

8.2.1. Intervalos de confianza

Determinamos los intervalos de confianza para las observaciones de nuestros datos.

intervalos = predict(mejor_modelo, interval = "confidence", level = 0.95)
head(intervalos)
##        fit      lwr      upr
## 1 5.545952 5.455099 5.636806
## 2 5.095579 5.013600 5.177558
## 3 4.981789 4.813456 5.150122
## 4 5.381935 5.312809 5.451061
## 5 5.445889 5.371787 5.519992
## 6 4.747729 4.589377 4.906082

8.2.2. Tabla anova

La tabla anova nos muestra la significación de la regresión

anova = aov(mejor_modelo)
summary(anova)
##                            Df Sum Sq Mean Sq F value   Pr(>F)    
## alcohol                     1  192.8  192.84 492.415  < 2e-16 ***
## volatile_acidity            1   69.9   69.86 178.387  < 2e-16 ***
## Log_sulphates               1   29.7   29.72  75.883  < 2e-16 ***
## Log_chlorides               1    4.3    4.33  11.049 0.000913 ***
## pH                          1    5.1    5.10  13.019 0.000320 ***
## Log_total_sulfur_dioxide    1    2.5    2.46   6.277 0.012358 *  
## citric_acid                 1    1.0    1.01   2.584 0.108223    
## fixed_acidity               1    1.8    1.82   4.644 0.031357 *  
## Residuals                1270  497.4    0.39                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8.3. Diagnosis del modelo según el cumplimiento de los siguientes supuestos

8.3.1. Multicolinealidad

vif(mejor_modelo)
##                  alcohol         volatile_acidity            Log_sulphates 
##                 1.326586                 1.667617                 1.236976 
##            Log_chlorides                       pH Log_total_sulfur_dioxide 
##                 1.301107                 1.887347                 1.125057 
##              citric_acid            fixed_acidity 
##                 2.892183                 2.660426
mean(vif(mejor_modelo))
## [1] 1.762162

Generalmente, un VIF por encima de 4 o una tolerancia por debajo de 0,25 indica que podría existir multicolinealidad (fuerte correlación entre variables explicativas del modelo) y se requiere más investigación. Cuando el VIF es superior a 10 o la tolerancia es inferior a 0,1, existe una multicolinealidad significativa que debe corregirse. En este caso no se observa multicolinealidad.

8.3.2. Linealidad de los residuos

mean(mejor_modelo$residuals)
## [1] -1.419695e-17
# forma grafico 1
plot(mejor_modelo, 1)

# forma grafico 2 que te muestra lo mismo
autoplot(mejor_modelo, 1)

En el gráfico de Residuos vs. Ajustes se observa que la media de los residuos es cercana a cero (aunque no de forma constante), luego la linealidad del modelo no se viola en principio. Pero, al tener una variable dependiente como “quality” que es discreta, un modelo de regresión linela normal no se ajusta a nuestros datos.

8.3.3. Normalidad de los residuos

Primero se comprueba la normalidad de los residuos, pero al usar Shapiro test solo permite usar las 5000 primeras muestras de los residuos, así que también usamos Anderson-Darling para comparar resultados

Shapiro-Wilk:

# muestras_residuos=resid(mejor_modelo) obtengo la
# ditribucion de los residuos estandarizados
muestras_residuos1 = studres(mejor_modelo)
residual_norm = shapiro.test(muestras_residuos1[0:5000])
residual_norm
## 
##  Shapiro-Wilk normality test
## 
## data:  muestras_residuos1[0:5000]
## W = 0.98969, p-value = 7.913e-08

Anderson-Darling:

# install.packages('nortest')
residual_anderson = ad.test(muestras_residuos1)
residual_anderson
## 
##  Anderson-Darling normality test
## 
## data:  muestras_residuos1
## A = 4.0651, p-value = 4.078e-10

Este supuesto de normalidad de los residuos también se puede comprobar graficamente y como se ve en la gráfica nuestros datos se separan en las colas de la línea principal y eso nos puede indicar que los residuos no siguen una distribución normal.

# Estas tres graficas te muestran lo mismo
plot(mejor_modelo, 2)

autoplot(mejor_modelo, 2)

hist(muestras_residuos1, freq = FALSE, main = "Distribución de los residuos estandarizados")
xfit <- seq(min(muestras_residuos1), max(muestras_residuos1),
    length = 40)
yfit <- dnorm(xfit)
lines(xfit, yfit)

Con el Q-Q plot vemos que los residuos siguen una distribución normal o al menos se aproximan. Por tanto, se puede asumir que los estimadores de los coeficientes tengan una distribución normal.

8.3.4. Homocedasticidad

Vamos a comprobar la homocedasticidad (que los residuos tengan una varianza constante)

Como podemos ver en los resultados p_value < 0.05 por tanto se rechaza la hipotesis nula y esto indica que la varianza no es constante para este modelo de regresion lineal(hay heterocedasticidad, y esto es un problema). Podemos concluir que este modelo matemático no es adecuado.

# https://fhernanb.github.io/libro_regresion/homo.html otra
# prueba para comprobar homocedasticidad
ncvTest(mejor_modelo)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 17.5856, Df = 1, p = 2.7466e-05

También podemos comprobar gráficamente la hocedasticidad, sería bueno que la línea roja sea lo más recta/horizontal posible.

plot(mejor_modelo, 3)

autoplot(mejor_modelo, 3)

8.3.5. Independencia de los residuos

Como se puede ver en los resultados el p_value > 0.05 por lo que aceptamos la Ho de que hay independencia.

dwtest(mejor_modelo)
## 
##  Durbin-Watson test
## 
## data:  mejor_modelo
## DW = 2.0088, p-value = 0.563
## alternative hypothesis: true autocorrelation is greater than 0

Se puede comprobar la independencia de los residuos gráficamente y como se observa no se ven patrónes extraños y esto nos puede indicar que hay independencia en los residuos y que estos no presentan autocorrelación.

plot(mejor_modelo$resid)

acf(mejor_modelo$residuals)

ML1 - MACHINE LEARNING 1

9. Técnicas de reducción de la dimensionalidad

9.1. PCA - Principal Component Analysis

Para el análisis de componentes principales cogemos todas las variables de nuestro dataset, menos “quality” que es la que queremos tratar de predecir.

prcomp_train <- prcomp(train[, -6])
prcomp_train
## Standard deviations (1, .., p=11):
##  [1] 1.7932919537 1.1139049116 0.8715249896 0.3582166080 0.3099346034
##  [6] 0.3016065686 0.2104320866 0.1621592924 0.1042111183 0.0987024931
## [11] 0.0007134006
## 
## Rotation (n x k) = (11 x 11):
##                                   PC1           PC2           PC3          PC4
## fixed_acidity             0.987707565 -0.0103012861 -1.022231e-01  0.062459844
## volatile_acidity         -0.025573618 -0.0335218329  2.264482e-02 -0.079018774
## citric_acid               0.074203260  0.0250265448 -3.742426e-02 -0.046522512
## density                   0.000724006 -0.0007910339  8.763104e-05 -0.001630274
## alcohol                  -0.033238009  0.9069893906 -3.959749e-01 -0.033337688
## pH                       -0.052371326  0.0213916196  4.913437e-03  0.011180103
## Log_residual_sugar        0.039855973 -0.0021044752 -1.037845e-01 -0.826895669
## Log_chlorides             0.031702071 -0.0848605722  3.174456e-02 -0.391082136
## Log_free_sulfur_dioxide  -0.086908780 -0.2361937132 -6.575105e-01  0.323215453
## Log_total_sulfur_dioxide -0.061636505 -0.3343377688 -6.206551e-01 -0.211995173
## Log_sulphates             0.021906011  0.0167601084 -3.879522e-02 -0.018426525
##                                    PC5           PC6          PC7          PC8
## fixed_acidity            -0.0230198704 -0.0059651163 -0.064939449 -0.039228108
## volatile_acidity         -0.0390618557 -0.0303160677 -0.597211835 -0.541189128
## citric_acid               0.0729849365  0.0909937418  0.381491021  0.409952831
## density                  -0.0002677616 -0.0007711058  0.000185434 -0.001576883
## alcohol                   0.0565568851  0.0881221105 -0.082562820 -0.001514142
## pH                       -0.0569108452 -0.0518242796 -0.038186548 -0.140386718
## Log_residual_sugar       -0.2808810688 -0.4555153363  0.125710162 -0.007655261
## Log_chlorides             0.8611890499  0.1118830637 -0.226520421  0.140520863
## Log_free_sulfur_dioxide   0.2215346137 -0.5890496504 -0.021993993  0.034613521
## Log_total_sulfur_dioxide -0.1995578676  0.6396298085 -0.031382301 -0.022315853
## Log_sulphates             0.2770165639  0.0647750830  0.645544341 -0.704485728
##                                   PC9          PC10          PC11
## fixed_acidity             0.041080555 -0.0443511620  0.0008574666
## volatile_acidity         -0.081317870  0.5769403933  0.0004106125
## citric_acid               0.118259173  0.8056760674 -0.0002945593
## density                   0.004180642 -0.0002855141 -0.9999877190
## alcohol                  -0.015362480 -0.0183878205 -0.0008768774
## pH                        0.983885416 -0.0388597280  0.0043212789
## Log_residual_sugar       -0.025609909 -0.0311525353  0.0017331842
## Log_chlorides             0.071354850 -0.0867580300  0.0004730618
## Log_free_sulfur_dioxide  -0.012248313  0.0445217874 -0.0001883117
## Log_total_sulfur_dioxide  0.025005306 -0.0566273254  0.0002213595
## Log_sulphates            -0.053924917  0.0228971577  0.0009037505

Las desviaciones típicas son los autovalores de la matriz de correlaciones, y representan la variabilidad en cada componente. A mayor valor, más relevante es la variable correspondiente a efectos de visualización. Si queremos visualizar la importancia relativa de cada componente, haremos lo siguiente:

plot(prcomp_train)

De modo númérico:

summary(prcomp_train)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.7933 1.1139 0.8715 0.35822 0.30993 0.30161 0.21043
## Proportion of Variance 0.5719 0.2207 0.1351 0.02282 0.01708 0.01618 0.00788
## Cumulative Proportion  0.5719 0.7926 0.9277 0.95052 0.96761 0.98378 0.99166
##                            PC8     PC9    PC10      PC11
## Standard deviation     0.16216 0.10421 0.09870 0.0007134
## Proportion of Variance 0.00468 0.00193 0.00173 0.0000000
## Cumulative Proportion  0.99634 0.99827 1.00000 1.0000000

Para solucionar el problema de que una variable tenga más relevancia y sea más influyente por el hecho de tener más magnitud, se debe realizar una estandarización:

prcomp_train <- prcomp(train[, -6], centre = TRUE, scale = TRUE)
prcomp_train
## Standard deviations (1, .., p=11):
##  [1] 1.7629487 1.4397808 1.2644531 1.0430903 0.9814598 0.8259786 0.7584904
##  [8] 0.6527267 0.5072574 0.4102052 0.2443802
## 
## Rotation (n x k) = (11 x 11):
##                                   PC1           PC2         PC3         PC4
## fixed_acidity             0.497949147 -0.0774604042  0.07773493 -0.12761147
## volatile_acidity         -0.228613756  0.2973557116  0.42725833 -0.12854199
## citric_acid               0.453908847 -0.1799547686 -0.23564528 -0.04506366
## density                   0.414489403  0.2687629961  0.25839356 -0.17865109
## alcohol                  -0.094708706 -0.4167029020 -0.37845423 -0.33087315
## pH                       -0.409475084 -0.0009806387 -0.03806645 -0.15945585
## Log_residual_sugar        0.198616078  0.2082892990 -0.02938800 -0.70534037
## Log_chlorides             0.227309032  0.2157308428  0.20635906  0.40324174
## Log_free_sulfur_dioxide  -0.074197539  0.4790706213 -0.48908782  0.01812011
## Log_total_sulfur_dioxide -0.004662316  0.5505085334 -0.39476662  0.01314934
## Log_sulphates             0.220597718 -0.0694794605 -0.32548381  0.37112255
##                                  PC5         PC6         PC7        PC8
## fixed_acidity             0.20602872 -0.01298276  0.32851510 -0.2897369
## volatile_acidity         -0.17298151 -0.29204526  0.60308842 -0.1819769
## citric_acid               0.08260043 -0.05040931 -0.15768865 -0.3605598
## density                  -0.04472090  0.41364082  0.07930408 -0.1990981
## alcohol                  -0.26306784 -0.40477572  0.21105461 -0.2724748
## pH                       -0.33063659  0.52406092 -0.18375767 -0.5565766
## Log_residual_sugar       -0.37735390 -0.05607819 -0.22622011  0.4088799
## Log_chlorides            -0.50419008 -0.43132228 -0.39929144 -0.2507839
## Log_free_sulfur_dioxide   0.11152021 -0.07612622  0.07572221 -0.1353631
## Log_total_sulfur_dioxide  0.14453935 -0.07108447  0.02253235 -0.0743601
## Log_sulphates            -0.55709025  0.31997827  0.44949490  0.2744915
##                                  PC9         PC10         PC11
## fixed_acidity             0.31604555  0.126404238 -0.611040342
## volatile_acidity         -0.32007451 -0.213814616  0.005828197
## citric_acid              -0.61379756 -0.394018864  0.088273875
## density                   0.18509322  0.158370103  0.615559165
## alcohol                   0.19703270  0.271013852  0.317142337
## pH                       -0.02868781 -0.001296862 -0.277713094
## Log_residual_sugar       -0.04278776 -0.103130711 -0.205992070
## Log_chlorides             0.15996348  0.070066978 -0.059433915
## Log_free_sulfur_dioxide   0.43001182 -0.541972628  0.067211285
## Log_total_sulfur_dioxide -0.35634449  0.611908258 -0.086661149
## Log_sulphates            -0.08777829 -0.028491471 -0.064760413

De modo numérico también:

summary(prcomp_train)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6    PC7
## Standard deviation     1.7629 1.4398 1.2645 1.04309 0.98146 0.82598 0.7585
## Proportion of Variance 0.2825 0.1885 0.1454 0.09891 0.08757 0.06202 0.0523
## Cumulative Proportion  0.2825 0.4710 0.6163 0.71526 0.80283 0.86485 0.9172
##                            PC8     PC9   PC10    PC11
## Standard deviation     0.65273 0.50726 0.4102 0.24438
## Proportion of Variance 0.03873 0.02339 0.0153 0.00543
## Cumulative Proportion  0.95588 0.97927 0.9946 1.00000

Analizamos la varianzas y las componentes de un modo más gráfico:

prcomp_train.var <- prcomp_train$sdev^2
prcomp_train.pvar <- prcomp_train.var/sum(prcomp_train.var)
plot(cumsum(prcomp_train.pvar), xlab = "components", ylab = "cumulative variance",
    ylim = c(0, 1), type = "b")
grid()
abline(h = 0.95, col = "blue")

plot(prcomp_train, type = "l", main = "Variance explained by PCA")

fviz_screeplot(prcomp_train, addlabels = TRUE)

Como vemos, con las dos primeras componentes (PC1 y PC2) recogemos solo el 47.10% de la variabilidad. Con las tres primeras (PC1, PC2 y PC3) incrementamos la cifra hasta el 61.63%. Esto quiere decir que un gráfico de los datos del vino representados por las dos o tres primeras componentes principales no será suficientemente representativo. Vemos además en el gráfico de componentes y varianza acumulada, como son necesarias las 8 primeras PC para cubrir el 95% de la varianza del dataset. Es dificil encontrale sentido a reducir tan solo la dimensión de 11 variables a 8 PC, con la perdida de explicabilidad que eso implica sobre las variables originales.

9.1.1. Análisis PCA con 2 componentes (PC1 y PC2)

Dibujamos los datos proyectados sobre las dos primeras componentes:

ggplot(as.data.frame(prcomp(train[, -6], scale = T)$x[, 1:2]),
    aes(x = PC1, y = PC2, label = rownames(train))) + geom_point() +
    geom_text(hjust = 0, vjust = 0)

Tratamos de incorporar la información de las variables utilizando la técnica del “biplot”:

ggbiplot(prcomp(train[, -6], labels = rownames(train), scale = T))

  • Analizando la variable “quality” como numérica (variable continua):
ggbiplot(prcomp(train[, -6], scale = T), ellipse = TRUE, labels = rownames(train),
    groups = train$quality)

  • Analizando la variable “quality” como categórica (variable discreta):
train_fquality <- train %>%
    mutate(quality = as.factor(quality))

ggbiplot(prcomp_train, obs.scale = 1, var.scale = 1, groups = train_fquality$quality,
    ellipse = TRUE, circle = TRUE) + scale_color_discrete(name = "") +
    theme(legend.direction = "horizontal", legend.position = "top")

Vemos que el análisis con solo 2 componentes no es óptimo ya que por ellas mismas no explican un alto porcentaje de la varianza. Aún así, a nivel de análisis explicativo de los datos y de los posibles diferentes grupos, se intuye algún patrón ya que en principio cuanto más abajo del gráfico, mejor calificación tienen los vinos en general (puntos de colores azul y rosa son notas más cercanas a 7 y 8) y más arriba, peor calificación (puntos de colores verde, amarillo y rojo son notas de 5 para abajo).

9.1.2. Análisis PCA con 4 componentes (PC1, PC2, PC3 y PC4)

Realizamos una ampliación del análisis realizado utilizando las 4 primeras componentes principales para tratar de identificar posible agrupaciones más claras de los datos.

colores <- function(vec) {
    # la función rainbow() devuelve un vector que contiene
    # el número de colores distintos
    col <- rainbow(length(unique(vec)))
    return(col[as.numeric(as.factor(vec))])
}

par(mfrow = c(1, 2))
# Observaciones sobre PC1 y PC2
plot(prcomp_train$x[, 1:2], col = colores(train_fquality$quality),
    pch = 19, xlab = "PC1", ylab = "PC2")
# Observaciones sobre PC1 y PC3
plot(prcomp_train$x[, c(1, 3)], col = colores(train_fquality$quality),
    pch = 19, xlab = "PC1", ylab = "PC3")

# Observaciones sobre PC1 y PC4
plot(prcomp_train$x[, c(1, 4)], col = colores(train_fquality$quality),
    pch = 19, xlab = "PC1", ylab = "PC4")

La utilización de más componentes (ampliando el análisis hasta la tercera y la cuarta PC) vemos que aporta muy poco y no vemos agrupaciones claras o destacables entre los diferentes grupos. Esto se debe que incluso utilizando las 4 dimensiones de las 4 primeras PC, apenas lograriamos explicar un 71.52% de la varianza de los datos.

9.1.3. Análisis PCA con creación de variable categórica binarizada

En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6, anotados con un “1”) o suspensas (quality < 6, anotados con un “0”).

train_pca <- train[, colnames(train) != "quality"]
train_pca$nota_vino <- factor(train$quality < 6, labels = c("aprobado",
    "suspenso"))  # levels = c('FALSE', 'TRUE')
str(train_pca)
## tibble [1,279 x 12] (S3: tbl_df/tbl/data.frame)
##  $ fixed_acidity           : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
##  $ volatile_acidity        : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
##  $ citric_acid             : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
##  $ density                 : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
##  $ alcohol                 : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
##  $ pH                      : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
##  $ Log_residual_sugar      : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
##  $ Log_chlorides           : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
##  $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
##  $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
##  $ Log_sulphates           : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
##  $ nota_vino               : Factor w/ 2 levels "aprobado","suspenso": 2 2 2 2 2 2 2 2 1 1 ...
table(train_pca$nota_vino)
## 
## aprobado suspenso 
##      682      597
train_pca
## # A tibble: 1,279 x 12
##    fixed_acidity volatile_acidity citric_acid density alcohol    pH
##            <dbl>            <dbl>       <dbl>   <dbl>   <dbl> <dbl>
##  1           7.1            0.48         0.28   0.997    10.3  3.24
##  2           7.6            0.49         0.33   0.997     9    3.3 
##  3           5              1.02         0.04   0.994    10.5  3.75
##  4           7.6            0.43         0.29   0.997     9.5  3.4 
##  5           6.8            0.59         0.1    0.996     9.7  3.3 
##  6           6.8            0.815        0      0.995     9.8  3.3 
##  7           8.5            0.21         0.52   0.996    10.4  3.36
##  8           7.4            0.36         0.29   0.996    11    3.3 
##  9           5.5            0.49         0.03   0.991    14    3.3 
## 10           6.8            0.49         0.22   0.994    11.3  3.41
## # ... with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## #   Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## #   Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <fct>

Pasamos a realizar el análisis de las Componentes Principales tal y como se ha hecho con anterioridad:

prcomp_train_2 <- prcomp(train_pca[, -12], centre = TRUE, scale = TRUE)
prcomp_train_2
## Standard deviations (1, .., p=11):
##  [1] 1.7629487 1.4397808 1.2644531 1.0430903 0.9814598 0.8259786 0.7584904
##  [8] 0.6527267 0.5072574 0.4102052 0.2443802
## 
## Rotation (n x k) = (11 x 11):
##                                   PC1           PC2         PC3         PC4
## fixed_acidity             0.497949147 -0.0774604042  0.07773493 -0.12761147
## volatile_acidity         -0.228613756  0.2973557116  0.42725833 -0.12854199
## citric_acid               0.453908847 -0.1799547686 -0.23564528 -0.04506366
## density                   0.414489403  0.2687629961  0.25839356 -0.17865109
## alcohol                  -0.094708706 -0.4167029020 -0.37845423 -0.33087315
## pH                       -0.409475084 -0.0009806387 -0.03806645 -0.15945585
## Log_residual_sugar        0.198616078  0.2082892990 -0.02938800 -0.70534037
## Log_chlorides             0.227309032  0.2157308428  0.20635906  0.40324174
## Log_free_sulfur_dioxide  -0.074197539  0.4790706213 -0.48908782  0.01812011
## Log_total_sulfur_dioxide -0.004662316  0.5505085334 -0.39476662  0.01314934
## Log_sulphates             0.220597718 -0.0694794605 -0.32548381  0.37112255
##                                  PC5         PC6         PC7        PC8
## fixed_acidity             0.20602872 -0.01298276  0.32851510 -0.2897369
## volatile_acidity         -0.17298151 -0.29204526  0.60308842 -0.1819769
## citric_acid               0.08260043 -0.05040931 -0.15768865 -0.3605598
## density                  -0.04472090  0.41364082  0.07930408 -0.1990981
## alcohol                  -0.26306784 -0.40477572  0.21105461 -0.2724748
## pH                       -0.33063659  0.52406092 -0.18375767 -0.5565766
## Log_residual_sugar       -0.37735390 -0.05607819 -0.22622011  0.4088799
## Log_chlorides            -0.50419008 -0.43132228 -0.39929144 -0.2507839
## Log_free_sulfur_dioxide   0.11152021 -0.07612622  0.07572221 -0.1353631
## Log_total_sulfur_dioxide  0.14453935 -0.07108447  0.02253235 -0.0743601
## Log_sulphates            -0.55709025  0.31997827  0.44949490  0.2744915
##                                  PC9         PC10         PC11
## fixed_acidity             0.31604555  0.126404238 -0.611040342
## volatile_acidity         -0.32007451 -0.213814616  0.005828197
## citric_acid              -0.61379756 -0.394018864  0.088273875
## density                   0.18509322  0.158370103  0.615559165
## alcohol                   0.19703270  0.271013852  0.317142337
## pH                       -0.02868781 -0.001296862 -0.277713094
## Log_residual_sugar       -0.04278776 -0.103130711 -0.205992070
## Log_chlorides             0.15996348  0.070066978 -0.059433915
## Log_free_sulfur_dioxide   0.43001182 -0.541972628  0.067211285
## Log_total_sulfur_dioxide -0.35634449  0.611908258 -0.086661149
## Log_sulphates            -0.08777829 -0.028491471 -0.064760413
summary(prcomp_train_2)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6    PC7
## Standard deviation     1.7629 1.4398 1.2645 1.04309 0.98146 0.82598 0.7585
## Proportion of Variance 0.2825 0.1885 0.1454 0.09891 0.08757 0.06202 0.0523
## Cumulative Proportion  0.2825 0.4710 0.6163 0.71526 0.80283 0.86485 0.9172
##                            PC8     PC9   PC10    PC11
## Standard deviation     0.65273 0.50726 0.4102 0.24438
## Proportion of Variance 0.03873 0.02339 0.0153 0.00543
## Cumulative Proportion  0.95588 0.97927 0.9946 1.00000
ggbiplot(prcomp_train_2, obs.scale = 1, var.scale = 1, groups = train_pca$nota_vino,
    ellipse = TRUE, circle = TRUE) + scale_color_discrete(name = "") +
    theme(legend.direction = "horizontal", legend.position = "top")

Vemos que los resultados obtenidos son los mismos, no obteniendo ninguna mejora. Con esta forma de mostrar los datos realizamos una visualización más clara de lo que nos referiamos.Los puntos más abajo del gráfico se corresponden en general a vinos “aprobados” (puntos de color rosado - vinos con nota igual o superior a 6) y los de más arriba se referencian en general a vinos “suspensos” (puntos de color azulado - vino con notas inferiores a 6). Fuera de eso, y con tan solo un 47.10% de la varianza explicada por las 2 primeras PC, no se aprecian más patrones o conclusiones en los datos.

9.2. t-SNE - t-distributed stochastic neighbor embedding

Intentamos realizar una reducción de la dimensión pero esta vez con métodos, al contrario de PCA, que no sean lineales. Con el algoritmo de t-SNE podemos separar datos que no sean separables de una forma lineal con exclusivamente una línea recta, es decir, nos puede llegar a permitir trabajar con datos lineales no separables. Nos puede servir para llegar a entender datos que tienen una alta dimensión projectándolo a una dimensión menor de solo 2 o 3 espacios o dimensiones.

tsne_train <- (train[, -6])
tsne_train
## # A tibble: 1,279 x 11
##    fixed_acidity volatile_acidity citric_acid density alcohol    pH
##            <dbl>            <dbl>       <dbl>   <dbl>   <dbl> <dbl>
##  1           7.1            0.48         0.28   0.997    10.3  3.24
##  2           7.6            0.49         0.33   0.997     9    3.3 
##  3           5              1.02         0.04   0.994    10.5  3.75
##  4           7.6            0.43         0.29   0.997     9.5  3.4 
##  5           6.8            0.59         0.1    0.996     9.7  3.3 
##  6           6.8            0.815        0      0.995     9.8  3.3 
##  7           8.5            0.21         0.52   0.996    10.4  3.36
##  8           7.4            0.36         0.29   0.996    11    3.3 
##  9           5.5            0.49         0.03   0.991    14    3.3 
## 10           6.8            0.49         0.22   0.994    11.3  3.41
## # ... with 1,269 more rows, and 5 more variables: Log_residual_sugar <dbl>,
## #   Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## #   Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>

El algoritmo crea una probabilidad de distribución que representa las similaridades entre los vecinos para así tratar de agruparlos, reduciendo la dimensión.

set.seed(3)
tsne_data <- tsne_train[, 1:11]

tsne <- Rtsne(tsne_data, check_duplicates = FALSE, perplexity = 30,
    pca = FALSE, theta = 0.5, dims = 2, max_iter = 500, eta = 200,
    epoch = 1000)

par(mfrow = c(1, 2))
plot(tsne$Y, col = "black", bg = train_fquality$quality, pch = 21,
    cex = 1.5, main = "tSNE", xlab = "tSNE dimension 1", ylab = "tSNE dimension 2")

Como vemos los resultados, como en PCA, no son satisfactorios, siendo no deseable la apliclación de estas técnicas en nuestro dataset.

9.2.1. Análisis t-SNE con creación de variable categórica binarizada

En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6, anotados con un “1”) o suspensas (quality < 6, anotados con un “0”).

train_tsne <- train[, colnames(train) != "quality"]
train_tsne$nota_vino <- factor(train$quality < 6, labels = c("aprobado",
    "suspenso"))  # levels = c('FALSE', 'TRUE')
str(train_tsne)
## tibble [1,279 x 12] (S3: tbl_df/tbl/data.frame)
##  $ fixed_acidity           : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
##  $ volatile_acidity        : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
##  $ citric_acid             : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
##  $ density                 : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
##  $ alcohol                 : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
##  $ pH                      : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
##  $ Log_residual_sugar      : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
##  $ Log_chlorides           : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
##  $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
##  $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
##  $ Log_sulphates           : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
##  $ nota_vino               : Factor w/ 2 levels "aprobado","suspenso": 2 2 2 2 2 2 2 2 1 1 ...
table(train_tsne$nota_vino)
## 
## aprobado suspenso 
##      682      597
train_tsne
## # A tibble: 1,279 x 12
##    fixed_acidity volatile_acidity citric_acid density alcohol    pH
##            <dbl>            <dbl>       <dbl>   <dbl>   <dbl> <dbl>
##  1           7.1            0.48         0.28   0.997    10.3  3.24
##  2           7.6            0.49         0.33   0.997     9    3.3 
##  3           5              1.02         0.04   0.994    10.5  3.75
##  4           7.6            0.43         0.29   0.997     9.5  3.4 
##  5           6.8            0.59         0.1    0.996     9.7  3.3 
##  6           6.8            0.815        0      0.995     9.8  3.3 
##  7           8.5            0.21         0.52   0.996    10.4  3.36
##  8           7.4            0.36         0.29   0.996    11    3.3 
##  9           5.5            0.49         0.03   0.991    14    3.3 
## 10           6.8            0.49         0.22   0.994    11.3  3.41
## # ... with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## #   Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## #   Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <fct>
set.seed(3)
tsne_data_2 <- train_tsne[, 1:11]

tsne_2 <- Rtsne(tsne_data_2, check_duplicates = FALSE, perplexity = 30,
    pca = FALSE, theta = 0.5, dims = 2, max_iter = 500, eta = 200,
    epoch = 1000)

par(mfrow = c(1, 2))
plot(tsne$Y, col = "black", bg = train_tsne$nota_vino, pch = 21,
    cex = 1.5, main = "tSNE", xlab = "tSNE dimension 1", ylab = "tSNE dimension 2")

Binarizando la variable respuesta tampoco sacamos demasiado en claro, no siendo posible aplicar una reducción de la dimensión sobre nuestros datos.

10. Aprendizaje supervisado

10.1. GLM - Generalized Lineal Model

10.1.1. Creación de variable binaria y análisis de relaciones entre variables

Lo primero de todo, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6, anotados con un “1”) o suspensas (quality < 6, anotados con un “0”)

train_glm <- train %>%
    mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
    mutate(quality = NULL)
train_glm
## # A tibble: 1,279 x 12
##    fixed_acidity volatile_acidity citric_acid density alcohol    pH
##            <dbl>            <dbl>       <dbl>   <dbl>   <dbl> <dbl>
##  1           7.1            0.48         0.28   0.997    10.3  3.24
##  2           7.6            0.49         0.33   0.997     9    3.3 
##  3           5              1.02         0.04   0.994    10.5  3.75
##  4           7.6            0.43         0.29   0.997     9.5  3.4 
##  5           6.8            0.59         0.1    0.996     9.7  3.3 
##  6           6.8            0.815        0      0.995     9.8  3.3 
##  7           8.5            0.21         0.52   0.996    10.4  3.36
##  8           7.4            0.36         0.29   0.996    11    3.3 
##  9           5.5            0.49         0.03   0.991    14    3.3 
## 10           6.8            0.49         0.22   0.994    11.3  3.41
## # ... with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## #   Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## #   Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_glm$nota_vino)
## 
##   0   1 
## 597 682
str(train_glm)
## tibble [1,279 x 12] (S3: tbl_df/tbl/data.frame)
##  $ fixed_acidity           : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
##  $ volatile_acidity        : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
##  $ citric_acid             : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
##  $ density                 : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
##  $ alcohol                 : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
##  $ pH                      : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
##  $ Log_residual_sugar      : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
##  $ Log_chlorides           : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
##  $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
##  $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
##  $ Log_sulphates           : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
##  $ nota_vino               : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...

Realizando esta distinción entre vinos “Aprobados” y “Suspensos”, vemos que la distibución entre ambos grupos está bastante balanceada, con 597 suspensos y 682 aprobados en los datos de train.

Tras ello, pasamos a ver las correlaciones y el comportamiento de las variables con esta nueva variable categórica creada:

c <- cor(train_glm)
corrplot(c)

Mostramos las correlaciones de forma numérica:

round(c, 2)
##                          fixed_acidity volatile_acidity citric_acid density
## fixed_acidity                     1.00            -0.26        0.68    0.68
## volatile_acidity                 -0.26             1.00       -0.55    0.02
## citric_acid                       0.68            -0.55        1.00    0.37
## density                           0.68             0.02        0.37    1.00
## alcohol                          -0.05            -0.21        0.15   -0.49
## pH                               -0.64             0.22       -0.49   -0.32
## Log_residual_sugar                0.20             0.04        0.19    0.44
## Log_chlorides                     0.16             0.09        0.16    0.33
## Log_free_sulfur_dioxide          -0.18             0.03       -0.11   -0.04
## Log_total_sulfur_dioxide         -0.12             0.08       -0.03    0.11
## Log_sulphates                     0.19            -0.30        0.32    0.14
## nota_vino                         0.11            -0.32        0.18   -0.15
##                          alcohol    pH Log_residual_sugar Log_chlorides
## fixed_acidity              -0.05 -0.64               0.20          0.16
## volatile_acidity           -0.21  0.22               0.04          0.09
## citric_acid                 0.15 -0.49               0.19          0.16
## density                    -0.49 -0.32               0.44          0.33
## alcohol                     1.00  0.18               0.06         -0.29
## pH                          0.18  1.00              -0.10         -0.26
## Log_residual_sugar          0.06 -0.10               1.00          0.12
## Log_chlorides              -0.29 -0.26               0.12          1.00
## Log_free_sulfur_dioxide    -0.09  0.08               0.10         -0.02
## Log_total_sulfur_dioxide   -0.24 -0.03               0.17          0.06
## Log_sulphates               0.13 -0.14               0.02          0.22
## nota_vino                   0.45 -0.01               0.00         -0.14
##                          Log_free_sulfur_dioxide Log_total_sulfur_dioxide
## fixed_acidity                              -0.18                    -0.12
## volatile_acidity                            0.03                     0.08
## citric_acid                                -0.11                    -0.03
## density                                    -0.04                     0.11
## alcohol                                    -0.09                    -0.24
## pH                                          0.08                    -0.03
## Log_residual_sugar                          0.10                     0.17
## Log_chlorides                              -0.02                     0.06
## Log_free_sulfur_dioxide                     1.00                     0.79
## Log_total_sulfur_dioxide                    0.79                     1.00
## Log_sulphates                               0.06                     0.04
## nota_vino                                  -0.06                    -0.20
##                          Log_sulphates nota_vino
## fixed_acidity                     0.19      0.11
## volatile_acidity                 -0.30     -0.32
## citric_acid                       0.32      0.18
## density                           0.14     -0.15
## alcohol                           0.13      0.45
## pH                               -0.14     -0.01
## Log_residual_sugar                0.02      0.00
## Log_chlorides                     0.22     -0.14
## Log_free_sulfur_dioxide           0.06     -0.06
## Log_total_sulfur_dioxide          0.04     -0.20
## Log_sulphates                     1.00      0.28
## nota_vino                         0.28      1.00

Analizamos de forma bivariante las variables:

# nota_vino vs alcohol
train_glm %>%
    ggplot(aes(x = alcohol, fill = factor(nota_vino))) + geom_density(alpha = 0.5)

# nota_vino vs Log_sulphates
train_glm %>%
    ggplot(aes(x = Log_sulphates, fill = factor(nota_vino))) +
    geom_density(alpha = 0.5)

# nota_vino vs volatile_acidity
train_glm %>%
    ggplot(aes(x = volatile_acidity, fill = factor(nota_vino))) +
    geom_density(alpha = 0.5)

# nota_vino vs density
train_glm %>%
    ggplot(aes(x = density, fill = factor(nota_vino))) + geom_density(alpha = 0.5)

# nota_vino vs citric_acid
train_glm %>%
    ggplot(aes(x = citric_acid, fill = factor(nota_vino))) +
    geom_density(alpha = 0.5)

# nota_vino vs Log_total_sulfur_dioxide
train_glm %>%
    ggplot(aes(x = Log_total_sulfur_dioxide, fill = factor(nota_vino))) +
    geom_density(alpha = 0.5)

En términos generales vemos como los vinos analizados que estan en la categoria de aprobados, tienen un mayor valor de “alcohol”, levenmente mayor valor de “Log_sulphates”, menor valor de “volatile_acidity”, levemente menor “density”, mayor “citric_acid” y menor valor de “Log_total_sulfur_dioxide”.

# nota_vino vs fixed_acidity
train_glm %>%
    ggplot(aes(x = fixed_acidity, fill = factor(nota_vino))) +
    geom_density(alpha = 0.5)

# nota_vino vs Log_free_sulfur_dioxide
train_glm %>%
    ggplot(aes(x = Log_free_sulfur_dioxide, fill = factor(nota_vino))) +
    geom_density(alpha = 0.5)

# nota_vino vs Log_residual_sugar
train_glm %>%
    ggplot(aes(x = Log_residual_sugar, fill = factor(nota_vino))) +
    geom_density(alpha = 0.5)

# nota_vino vs pH
train_glm %>%
    ggplot(aes(x = pH, fill = factor(nota_vino))) + geom_density(alpha = 0.5)

# nota_vino vs Log_chlorides
train_glm %>%
    ggplot(aes(x = Log_chlorides, fill = factor(nota_vino))) +
    geom_density(alpha = 0.5)

En los casos de las variables “Log_chlorides”, “pH”, “Log_residual_sugar”, “Log_free_sulfur_dioxide” y “fixed_acidity”, cuesta más distinguir en el gráfico de densidad entre vinos aprobados o suspensos, ya que no son características definitivas de un grupo u otro.

10.1.2. Creación del modelo de Regresión Logística

Generamos un modelo de regresión logística en base a las variables de nuestro dataset que sirva como predictor de la variable binaria creada.

modelo_glm = glm(nota_vino ~ alcohol + fixed_acidity + volatile_acidity +
    citric_acid + Log_chlorides + Log_total_sulfur_dioxide +
    Log_free_sulfur_dioxide + density + pH + Log_sulphates, data = train_glm,
    family = binomial)

modelo_glm
## 
## Call:  glm(formula = nota_vino ~ alcohol + fixed_acidity + volatile_acidity + 
##     citric_acid + Log_chlorides + Log_total_sulfur_dioxide + 
##     Log_free_sulfur_dioxide + density + pH + Log_sulphates, family = binomial, 
##     data = train_glm)
## 
## Coefficients:
##              (Intercept)                   alcohol             fixed_acidity  
##                 -24.7457                    0.9700                    0.1753  
##         volatile_acidity               citric_acid             Log_chlorides  
##                  -3.2462                   -1.7627                   -0.4797  
## Log_total_sulfur_dioxide   Log_free_sulfur_dioxide                   density  
##                  -0.6740                    0.4413                   17.5474  
##                       pH             Log_sulphates  
##                  -0.1818                    2.6310  
## 
## Degrees of Freedom: 1278 Total (i.e. Null);  1268 Residual
## Null Deviance:       1767 
## Residual Deviance: 1300  AIC: 1322
summary(modelo_glm)
## 
## Call:
## glm(formula = nota_vino ~ alcohol + fixed_acidity + volatile_acidity + 
##     citric_acid + Log_chlorides + Log_total_sulfur_dioxide + 
##     Log_free_sulfur_dioxide + density + pH + Log_sulphates, family = binomial, 
##     data = train_glm)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3178  -0.8232   0.3053   0.7861   2.4429  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -24.74575   73.63188  -0.336  0.73682    
## alcohol                    0.96996    0.10386   9.340  < 2e-16 ***
## fixed_acidity              0.17526    0.09475   1.850  0.06436 .  
## volatile_acidity          -3.24621    0.54263  -5.982 2.20e-09 ***
## citric_acid               -1.76266    0.60585  -2.909  0.00362 ** 
## Log_chlorides             -0.47972    0.24746  -1.939  0.05256 .  
## Log_total_sulfur_dioxide  -0.67404    0.17254  -3.907 9.36e-05 ***
## Log_free_sulfur_dioxide    0.44134    0.16690   2.644  0.00818 ** 
## density                   17.54736   74.95795   0.234  0.81491    
## pH                        -0.18179    0.73995  -0.246  0.80593    
## Log_sulphates              2.63103    0.39252   6.703 2.04e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1767.4  on 1278  degrees of freedom
## Residual deviance: 1300.2  on 1268  degrees of freedom
## AIC: 1322.2
## 
## Number of Fisher Scoring iterations: 4

Como observamos, nos quedamos solo con las variables significativas que relamente afectan a “nota_vino”, y creamos un nuevo modelo exclusivamente con ellas (“Log_sulphates”, “Log_total_sulfur_dioxide”, “volatile_acidity” y “alcohol”). De esta forma simplificamos el modelo, nos quedamos con las varibales realmente importantes para el modelo predictor y creamos el mejor modelo de regresión logística posible para nuestro conjunto de datos.

modelo_glm2 = glm(nota_vino ~ alcohol + volatile_acidity + Log_sulphates +
    Log_total_sulfur_dioxide, data = train_glm, family = binomial)

modelo_glm2
## 
## Call:  glm(formula = nota_vino ~ alcohol + volatile_acidity + Log_sulphates + 
##     Log_total_sulfur_dioxide, family = binomial, data = train_glm)
## 
## Coefficients:
##              (Intercept)                   alcohol          volatile_acidity  
##                  -5.9714                    0.9694                   -2.7563  
##            Log_sulphates  Log_total_sulfur_dioxide  
##                   2.2830                   -0.3912  
## 
## Degrees of Freedom: 1278 Total (i.e. Null);  1274 Residual
## Null Deviance:       1767 
## Residual Deviance: 1329  AIC: 1339
summary(modelo_glm2)
## 
## Call:
## glm(formula = nota_vino ~ alcohol + volatile_acidity + Log_sulphates + 
##     Log_total_sulfur_dioxide, family = binomial, data = train_glm)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0757  -0.8344   0.2981   0.8035   2.3837  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -5.97144    0.98206  -6.081 1.20e-09 ***
## alcohol                   0.96938    0.07907  12.260  < 2e-16 ***
## volatile_acidity         -2.75634    0.41634  -6.620 3.58e-11 ***
## Log_sulphates             2.28302    0.35200   6.486 8.82e-11 ***
## Log_total_sulfur_dioxide -0.39120    0.10198  -3.836 0.000125 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1767.4  on 1278  degrees of freedom
## Residual deviance: 1328.6  on 1274  degrees of freedom
## AIC: 1338.6
## 
## Number of Fisher Scoring iterations: 4

Para realizar la interpretación de los coeficientes:

round(exp(cbind(Estimate = coef(modelo_glm2), confint(modelo_glm2))),
    2)
##                          Estimate 2.5 % 97.5 %
## (Intercept)                  0.00  0.00   0.02
## alcohol                      2.64  2.27   3.09
## volatile_acidity             0.06  0.03   0.14
## Log_sulphates                9.81  4.97  19.79
## Log_total_sulfur_dioxide     0.68  0.55   0.83

Los intervalos de confianza no se basan en un test de Wald (como en regresión tradicional), sino en un perfilado (profiling) de la log-likelihood, que es más preciso.

Predicción de valores del modelo:

head(predict(modelo_glm2))
##          1          2          3          4          5          6 
##  0.1559899 -1.5792806 -1.4338215 -0.6054977 -0.6622377 -1.5725449

Probabilidad en escala de la salida:

head(predict(modelo_glm2, type = "response"))
##         1         2         3         4         5         6 
## 0.5389186 0.1708974 0.1925039 0.3530869 0.3402371 0.1718539

Evaluación del rendimiento predictivo del modelo GLM presentado con las datos de train:

train_glm$y_pred_probs <- predict(modelo_glm2, train_glm, type = "response")
train_glm$y_pred <- ifelse(train_glm$y_pred_probs > 0.5, 1, 0)

# train_glm$y_pred_probs train_glm$y_pred
cm_train <- confusionMatrix(as.factor(train_glm$y_pred), as.factor(train_glm$nota_vino),
    positive = "1")
cm_train$table
##           Reference
## Prediction   0   1
##          0 445 172
##          1 152 510
# result
cm_train$overall["Accuracy"] %>%
    round(2)
## Accuracy 
##     0.75
cm_train$byClass["Recall"] %>%
    round(2)
## Recall 
##   0.75
cm_train$byClass["Precision"] %>%
    round(2)
## Precision 
##      0.77

Viendo el valor de las metricas obtenidas, el valor de Accuracy (número de predicciones correctas/número total de predicciones) se situa en el 75%, el de Precision (positivos verdaderos/(positivos verdaderos + falsos positivos)) se situa en un 77%, y el de Recall o Sensitividad (positivos verdaderos/(positivos verdaderos/falsos negativos)) en un 75%.

Con estos datos entendemos que con el modelo desarrollado, en alrededor del 75% de los casos este será capaz de predecir si un vino aprueba en nota porque es razonablemente bueno (nota_vino >= 6) o sino suspende porque es realmente malo (nota_vino < 6).

10.1.3. GLM - Cross Validation, Hiperparámetros y Evaluación del modelo

Tratamos de aplicar Cross Validation sobre el modelo de GLM y realizar una selección de hiperparámetros:

Vemos primero cuales son las posibles variables que tienes el modelo para tratar de configurar. Cómo se puede ver, el modelo GLM no tiene la posibilidad de ajustar hiperparámetros.

## https://machinelearningmastery.com/how-to-estimate-model-accuracy-in-r-using-the-caret-package/?msclkid=37e9f222aa8711ec9c857e7c4b89d202
## https://daviddalpiaz.github.io/r4sl/the-caret-package.html#classification

# Vemos hiperparámetros que se pueden configurar

modelLookup("glm")
##   model parameter     label forReg forClass probModel
## 1   glm parameter parameter   TRUE     TRUE      TRUE

Creamos el modelo con las variables seleccionadas como relevantes y haciendo Cross Validation on 5 particiones del dataset de train.

caret.glm <- train(as.factor(nota_vino) ~ alcohol + volatile_acidity + Log_sulphates + Log_total_sulfur_dioxide, 
                   method = "glm",
                   family = "binomial",
                   data = train_glm,
                   trControl = trainControl(method = "cv", number = 5))
caret.glm
## Generalized Linear Model 
## 
## 1279 samples
##    4 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1024, 1024, 1023, 1022, 1023 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7482917  0.4956541
summary(caret.glm)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0757  -0.8344   0.2981   0.8035   2.3837  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -5.97144    0.98206  -6.081 1.20e-09 ***
## alcohol                   0.96938    0.07907  12.260  < 2e-16 ***
## volatile_acidity         -2.75634    0.41634  -6.620 3.58e-11 ***
## Log_sulphates             2.28302    0.35200   6.486 8.82e-11 ***
## Log_total_sulfur_dioxide -0.39120    0.10198  -3.836 0.000125 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1767.4  on 1278  degrees of freedom
## Residual deviance: 1328.6  on 1274  degrees of freedom
## AIC: 1338.6
## 
## Number of Fisher Scoring iterations: 4

Con estos datos entendemos que con el modelo desarrollado, en alrededor del 74/75% de los casos este será capaz de predecir si un vino aprueba en nota porque es razonablemente bueno (nota_vino >= 6) o sino suspende porque es realmente malo (nota_vino < 6).

confusionMatrix(caret.glm)
## Cross-Validated (5 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    0    1
##          0 35.1 13.6
##          1 11.6 39.7
##                             
##  Accuracy (average) : 0.7482

Evaluación del rendimiento predictivo del modelo GLM presentado con las datos de train:

train_glm$y_pred_probs2 <- predict(caret.glm, train_glm, type = "prob")
train_glm$y_pred_probs2 <- ifelse(train_glm$y_pred_probs2$`1` >
    0.5, train_glm$y_pred_probs2$`1`, 1 - train_glm$y_pred_probs2$`0`)
train_glm$y_pred2 <- ifelse(train_glm$y_pred_probs2 > 0.5, 1,
    0)

# train_glm$y_pred_probs2 train_glm$y_pred2

Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de GLM obtenido:

cm_train2 <- confusionMatrix(as.factor(train_glm$y_pred2), as.factor(train_glm$nota_vino),
    positive = "1")
cm_train2$table
##           Reference
## Prediction   0   1
##          0 445 172
##          1 152 510
# result
cm_train2$overall["Accuracy"] %>%
    round(2)
## Accuracy 
##     0.75
cm_train2$byClass["Recall"] %>%
    round(2)
## Recall 
##   0.75
cm_train2$byClass["Precision"] %>%
    round(2)
## Precision 
##      0.77

Reproducimos la curva ROC sobre el modelo final de GLM obtenido:

roc_glm <- plot.roc(as.numeric(train_glm$nota_vino), as.numeric(train_glm$y_pred_probs2))

auc(roc_glm)
## Area under the curve: 0.8195

Se obtiene alrededor de un 82% de área bajo la curva.

10.2. KNN - K Nearest Neighbors

En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).

train_knn <- train %>%
    mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
    mutate(quality = NULL)
train_knn
## # A tibble: 1,279 x 12
##    fixed_acidity volatile_acidity citric_acid density alcohol    pH
##            <dbl>            <dbl>       <dbl>   <dbl>   <dbl> <dbl>
##  1           7.1            0.48         0.28   0.997    10.3  3.24
##  2           7.6            0.49         0.33   0.997     9    3.3 
##  3           5              1.02         0.04   0.994    10.5  3.75
##  4           7.6            0.43         0.29   0.997     9.5  3.4 
##  5           6.8            0.59         0.1    0.996     9.7  3.3 
##  6           6.8            0.815        0      0.995     9.8  3.3 
##  7           8.5            0.21         0.52   0.996    10.4  3.36
##  8           7.4            0.36         0.29   0.996    11    3.3 
##  9           5.5            0.49         0.03   0.991    14    3.3 
## 10           6.8            0.49         0.22   0.994    11.3  3.41
## # ... with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## #   Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## #   Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_knn$nota_vino)
## 
##   0   1 
## 597 682
str(train_knn)
## tibble [1,279 x 12] (S3: tbl_df/tbl/data.frame)
##  $ fixed_acidity           : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
##  $ volatile_acidity        : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
##  $ citric_acid             : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
##  $ density                 : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
##  $ alcohol                 : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
##  $ pH                      : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
##  $ Log_residual_sugar      : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
##  $ Log_chlorides           : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
##  $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
##  $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
##  $ Log_sulphates           : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
##  $ nota_vino               : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
# REFERENCIA:https://www.edureka.co/blog/knn-algorithm-in-r/

# train_knn <- train[, colnames(train) != 'quality']
# train_knn$nota_vino <- factor(train$quality < 6, labels =
# c('aprobado','suspenso')) # levels = c('FALSE', 'TRUE')
# train_knn table(train_knn$nota_vino)

# str(train_knn)

Lo primero de todo calculamos el número de observaciones que tiene nuestro dataset en train. Queremos así ver de inicio el número de “K” o vecinos con el que cuenta nuestro conjunto de datos de entrenamiento, para posteriormente y en base a ello aproximar el óptimo valor de “K”.

NROW <- NROW(train_knn)
NROW
## [1] 1279

Para obtener el valor óptimo aproximado de “K” realizamos la raiz cuadrada del número total de observaciones del dataset de train

sqrt(1279)
## [1] 35.76311

Probaremos con 35 y 36 vecinos como una primera aproximación del “k” óptimo en un modelo de knn.

10.2.1. Creación del modelo de knn

Para tratar de realizar el modelo de knn dividimos nuestros datos de train en train y validación:

numero_total = nrow(train_knn)

w_train = 0.7
w_vali = 0.3

indices = seq(1:numero_total)

indices_train = sample(1:numero_total, numero_total * w_train)
indices_vali = sample(indices[-indices_train], numero_total *
    w_vali)

k_train = train_knn[indices_train, ]
k_train
## # A tibble: 895 x 12
##    fixed_acidity volatile_acidity citric_acid density alcohol    pH
##            <dbl>            <dbl>       <dbl>   <dbl>   <dbl> <dbl>
##  1          10.4             0.44        0.73   0.999   12     3.17
##  2          11.7             0.28        0.47   0.997   10.6   3.15
##  3          10.7             0.4         0.37   0.997   11.2   3.12
##  4           7.9             0.34        0.36   0.994   11.2   3.3 
##  5           6.1             0.58        0.23   0.994   12.5   3.46
##  6          11.5             0.59        0.59   0.999   11     3.18
##  7           6.8             0.59        0.1    0.996    9.7   3.3 
##  8           7.2             0.53        0.13   0.996    9.9   3.21
##  9           7.7             0.75        0.27   0.997    9.3   3.24
## 10           6.9             0.84        0.21   0.998    9.23  3.53
## # ... with 885 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## #   Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## #   Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
k_vali = train_knn[indices_vali, ]

Probamos un modelo simple de k vecinos con K = 35 y K = 36:

knn_simple_35 <- knn(k_train[, 1:11], k_vali[, 1:11], k = 35,
    cl = as.factor(k_train$nota_vino))
knn_simple_36 <- knn(k_train[, 1:11], k_vali[, 1:11], k = 36,
    cl = as.factor(k_train$nota_vino))

table(knn_simple_35, as.factor(k_vali$nota_vino))
##              
## knn_simple_35   0   1
##             0 141  58
##             1  42 142
table(knn_simple_36, as.factor(k_vali$nota_vino))
##              
## knn_simple_36   0   1
##             0 139  57
##             1  44 143
accuracy_knn_simple_35 = sum(knn_simple_35== k_vali$nota_vino) /nrow(k_vali)
accuracy_knn_simple_36 = sum(knn_simple_36== k_vali$nota_vino) /nrow(k_vali)

error_knn_simple_35 = 1-accuracy_knn_simple_35
error_knn_simple_36 = 1-accuracy_knn_simple_36

accuracy_knn_simple_35
## [1] 0.7389034
error_knn_simple_35
## [1] 0.2610966
print("...........................")
## [1] "..........................."
accuracy_knn_simple_36
## [1] 0.7362924
error_knn_simple_36
## [1] 0.2637076

Vemos que en ambos casos los resultados son muy parecidos obteniendo un accuracy de entorno al 73/74% de precisión.

10.2.2. KNN - Cross Validation, Hiperparámetros y Evaluación del modelo

Habiendo visto el modelo base, tratamos de ir un paso más allá creando otra versión que nos permita por un lado normalizar o estandarizar ls variables para tratarlas con una magnitud equivalente, ajustar de forma más precisa hiperparámetros a través de un número de “k” óptimo que venga dado realizando validación cruzada y evaluar un modelo para ver su precisión, robustez y capacidad de generalización.

modelLookup("knn")
##   model parameter      label forReg forClass probModel
## 1   knn         k #Neighbors   TRUE     TRUE      TRUE

Vemos que el parámetro que podemos ajustar es el valor de “k” que son el número de vecinos más cercanos con los que compararemos las diferentes observaciones y realizaremos la clasificación teniendo en cuena la distancia euclídea entre los puntos.

Tratamos de plantear un modelo de knn que incluya un proceso de validación cruzada de 5 folds, que centre y estándarice la escala de las variables, y que ajuste el hiperparámetro k de vecinos óptimo.

set.seed(22222220)

default_knn_mod = train(as.factor(nota_vino) ~ ., data = train_knn,
    method = "knn", trControl = trainControl(method = "cv", number = 5),
    preProcess = c("center", "scale"), tuneGrid = expand.grid(k = seq(1,
        101, by = 2)))

default_knn_mod
## k-Nearest Neighbors 
## 
## 1279 samples
##   11 predictor
##    2 classes: '0', '1' 
## 
## Pre-processing: centered (11), scaled (11) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1023, 1023, 1023, 1023, 1024 
## Resampling results across tuning parameters:
## 
##   k    Accuracy   Kappa    
##     1  0.7529259  0.5024638
##     3  0.7271293  0.4496676
##     5  0.7451256  0.4857637
##     7  0.7279075  0.4514521
##     9  0.7372702  0.4712542
##    11  0.7497855  0.4961405
##    13  0.7474357  0.4921884
##    15  0.7443045  0.4863425
##    17  0.7419700  0.4817875
##    19  0.7403860  0.4788672
##    21  0.7403922  0.4788520
##    23  0.7482108  0.4949287
##    25  0.7443076  0.4870466
##    27  0.7435202  0.4857680
##    29  0.7404013  0.4793634
##    31  0.7443107  0.4873547
##    33  0.7443168  0.4876106
##    35  0.7466575  0.4922032
##    37  0.7474357  0.4937484
##    39  0.7474387  0.4936762
##    41  0.7435294  0.4852850
##    43  0.7419669  0.4824611
##    45  0.7396201  0.4774906
##    47  0.7458732  0.4903228
##    49  0.7435263  0.4853138
##    51  0.7419669  0.4826133
##    53  0.7505576  0.4996890
##    55  0.7505607  0.4998633
##    57  0.7497794  0.4980862
##    59  0.7536918  0.5055618
##    61  0.7560386  0.5102242
##    63  0.7560417  0.5104059
##    65  0.7583885  0.5153964
##    67  0.7544761  0.5079165
##    69  0.7599510  0.5183972
##    71  0.7599540  0.5181401
##    73  0.7560355  0.5106282
##    75  0.7536918  0.5056467
##    77  0.7576011  0.5138357
##    79  0.7560509  0.5110409
##    81  0.7544761  0.5080859
##    83  0.7552574  0.5095624
##    85  0.7552543  0.5096503
##    87  0.7529136  0.5050209
##    89  0.7576072  0.5145375
##    91  0.7599540  0.5192917
##    93  0.7552635  0.5098494
##    95  0.7552727  0.5096173
##    97  0.7591850  0.5178204
##    99  0.7607445  0.5210021
##   101  0.7583946  0.5161423
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 99.
summary(default_knn_mod)
##             Length Class      Mode     
## learn        2     -none-     list     
## k            1     -none-     numeric  
## theDots      0     -none-     list     
## xNames      11     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    2     -none-     character
## param        0     -none-     list

Conseguimos un valor de k óptimo en 99 vecinos que nos da un accuracy del 76.09% mejorando los resultados obtenidos con anterioridad.

get_best_result = function(caret_fit) {
    best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
    best_result = caret_fit$results[best, ]
    rownames(best_result) = NULL
    best_result
}

get_best_result(default_knn_mod)
##    k  Accuracy     Kappa AccuracySD    KappaSD
## 1 99 0.7607445 0.5210021 0.02308807 0.04514979

Implementamos el valor optimo de k (k=99) en el modelo:

set.seed(22222220)

caret.knn = train(as.factor(nota_vino) ~ ., data = train_knn,
    method = "knn", trControl = trainControl(method = "cv", number = 5),
    preProcess = c("center", "scale"), tuneGrid = data.frame(k = 99))

caret.knn
## k-Nearest Neighbors 
## 
## 1279 samples
##   11 predictor
##    2 classes: '0', '1' 
## 
## Pre-processing: centered (11), scaled (11) 
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1023, 1023, 1023, 1023, 1024 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7599632  0.5193943
## 
## Tuning parameter 'k' was held constant at a value of 99
summary(caret.knn)
##             Length Class      Mode     
## learn        2     -none-     list     
## k            1     -none-     numeric  
## theDots      0     -none-     list     
## xNames      11     -none-     character
## problemType  1     -none-     character
## tuneValue    1     data.frame list     
## obsLevels    2     -none-     character
## param        0     -none-     list

Con estos datos entendemos que con el modelo desarrollado, en alrededor del 76/77% de los casos este será capaz de predecir si un vino aprueba en nota porque es razonablemente bueno (nota_vino >= 6) o sino suspende porque es realmente malo (nota_vino < 6).

Evaluación del rendimiento predictivo del modelo KNN presentado con las datos de train:

train_knn$y_pred_probs2 <- predict(caret.knn, newdata = train_knn,
    type = "prob")
train_knn$y_pred_probs2 <- ifelse(train_knn$y_pred_probs2$`1` >
    0.5, train_knn$y_pred_probs2$`1`, 1 - train_knn$y_pred_probs2$`0`)

train_knn$y_pred2 <- ifelse(train_knn$y_pred_probs2 > 0.5, 1,
    0)

# train_knn$y_pred_probs2 train_knn$y_pred2

Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de KNN obtenido:

cm_train_knn <- confusionMatrix(as.factor(train_knn$y_pred2),
    as.factor(train_knn$nota_vino), positive = "1")
cm_train_knn$table
##           Reference
## Prediction   0   1
##          0 463 164
##          1 134 518
# result
cm_train_knn$overall["Accuracy"] %>%
    round(2)
## Accuracy 
##     0.77
cm_train_knn$byClass["Recall"] %>%
    round(2)
## Recall 
##   0.76
cm_train_knn$byClass["Precision"] %>%
    round(2)
## Precision 
##      0.79

Reproducimos la curva ROC sobre el modelo final de KNN obtenido:

roc_knn <- plot.roc(as.numeric(train_knn$nota_vino), as.numeric(train_knn$y_pred_probs2))

auc(roc_knn)
## Area under the curve: 0.8268

Se obtiene alrededor de un 82/83% de área bajo la curva.

10.3. DECISION TREE

En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).

# train_tree <- train[, colnames(train)!='quality']
# train_tree$nota_vino <- factor(train$quality < 6, labels
# = c('aprobado', 'suspenso')) train_tree str(train_tree)
train_tree <- train %>%
    mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
    mutate(quality = NULL)
train_tree
## # A tibble: 1,279 x 12
##    fixed_acidity volatile_acidity citric_acid density alcohol    pH
##            <dbl>            <dbl>       <dbl>   <dbl>   <dbl> <dbl>
##  1           7.1            0.48         0.28   0.997    10.3  3.24
##  2           7.6            0.49         0.33   0.997     9    3.3 
##  3           5              1.02         0.04   0.994    10.5  3.75
##  4           7.6            0.43         0.29   0.997     9.5  3.4 
##  5           6.8            0.59         0.1    0.996     9.7  3.3 
##  6           6.8            0.815        0      0.995     9.8  3.3 
##  7           8.5            0.21         0.52   0.996    10.4  3.36
##  8           7.4            0.36         0.29   0.996    11    3.3 
##  9           5.5            0.49         0.03   0.991    14    3.3 
## 10           6.8            0.49         0.22   0.994    11.3  3.41
## # ... with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## #   Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## #   Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_tree$nota_vino)
## 
##   0   1 
## 597 682
str(train_tree)
## tibble [1,279 x 12] (S3: tbl_df/tbl/data.frame)
##  $ fixed_acidity           : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
##  $ volatile_acidity        : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
##  $ citric_acid             : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
##  $ density                 : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
##  $ alcohol                 : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
##  $ pH                      : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
##  $ Log_residual_sugar      : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
##  $ Log_chlorides           : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
##  $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
##  $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
##  $ Log_sulphates           : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
##  $ nota_vino               : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...

Creamos un modelo de árbol de decisión inicial básico y sin podar:

# árbol de clasificación con las opciones por defecto (cp = 0.01 y split = "gini") con el comando:
tree = rpart(as.factor(nota_vino) ~ ., data = train_tree, cp=0.006)
rpart.plot(tree, nn = TRUE, extra = 104,  box.palette = "GnBu", branch.lty = 3, shadow.col = "gray")

tree
## n= 1279 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 1279 597 1 (0.4667709 0.5332291)  
##    2) alcohol< 10.525 779 280 0 (0.6405648 0.3594352)  
##      4) Log_sulphates< -0.4700356 494 129 0 (0.7388664 0.2611336) *
##      5) Log_sulphates>=-0.4700356 285 134 1 (0.4701754 0.5298246)  
##       10) volatile_acidity>=0.545 103  33 0 (0.6796117 0.3203883)  
##         20) Log_chlorides>=-2.333097 36   4 0 (0.8888889 0.1111111) *
##         21) Log_chlorides< -2.333097 67  29 0 (0.5671642 0.4328358)  
##           42) fixed_acidity< 9.55 59  21 0 (0.6440678 0.3559322) *
##           43) fixed_acidity>=9.55 8   0 1 (0.0000000 1.0000000) *
##       11) volatile_acidity< 0.545 182  64 1 (0.3516484 0.6483516)  
##         22) Log_total_sulfur_dioxide>=4.166635 42  17 0 (0.5952381 0.4047619) *
##         23) Log_total_sulfur_dioxide< 4.166635 140  39 1 (0.2785714 0.7214286) *
##    3) alcohol>=10.525 500  98 1 (0.1960000 0.8040000)  
##      6) volatile_acidity>=0.87 24   6 0 (0.7500000 0.2500000) *
##      7) volatile_acidity< 0.87 476  80 1 (0.1680672 0.8319328) *

Analizamos los resultados obtenidos de forma numérica:

rpart.rules(tree, style = "tall")
## as.factor(nota_vino) is 0.11 when
##     alcohol < 11
##     volatile_acidity >= 0.55
##     Log_sulphates >= -0.47
##     Log_chlorides >= -2.3
## 
## as.factor(nota_vino) is 0.25 when
##     alcohol >= 11
##     volatile_acidity >= 0.87
## 
## as.factor(nota_vino) is 0.26 when
##     alcohol < 11
##     Log_sulphates < -0.47
## 
## as.factor(nota_vino) is 0.36 when
##     alcohol < 11
##     volatile_acidity >= 0.55
##     Log_sulphates >= -0.47
##     Log_chlorides < -2.3
##     fixed_acidity < 9.6
## 
## as.factor(nota_vino) is 0.40 when
##     alcohol < 11
##     volatile_acidity < 0.55
##     Log_sulphates >= -0.47
##     Log_total_sulfur_dioxide >= 4.2
## 
## as.factor(nota_vino) is 0.72 when
##     alcohol < 11
##     volatile_acidity < 0.55
##     Log_sulphates >= -0.47
##     Log_total_sulfur_dioxide < 4.2
## 
## as.factor(nota_vino) is 0.83 when
##     alcohol >= 11
##     volatile_acidity < 0.87
## 
## as.factor(nota_vino) is 1.00 when
##     alcohol < 11
##     volatile_acidity >= 0.55
##     Log_sulphates >= -0.47
##     Log_chlorides < -2.3
##     fixed_acidity >= 9.6
  • Modelo Decision Tree sin poda:
obs_tree1 <- as.factor(train_tree$nota_vino)
head(predict(tree, newdata = train_tree))
##           0         1
## 1 0.7388664 0.2611336
## 2 0.7388664 0.2611336
## 3 0.7388664 0.2611336
## 4 0.5952381 0.4047619
## 5 0.6440678 0.3559322
## 6 0.7388664 0.2611336
pred_tree1 <- predict(tree, newdata = train_tree, type = "class")
table(obs_tree1, pred_tree1)
##          pred_tree1
## obs_tree1   0   1
##         0 478 119
##         1 177 505
caret::confusionMatrix(pred_tree1, obs_tree1)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 478 177
##          1 119 505
##                                           
##                Accuracy : 0.7686          
##                  95% CI : (0.7445, 0.7914)
##     No Information Rate : 0.5332          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5379          
##                                           
##  Mcnemar's Test P-Value : 0.0009228       
##                                           
##             Sensitivity : 0.8007          
##             Specificity : 0.7405          
##          Pos Pred Value : 0.7298          
##          Neg Pred Value : 0.8093          
##              Prevalence : 0.4668          
##          Detection Rate : 0.3737          
##    Detection Prevalence : 0.5121          
##       Balanced Accuracy : 0.7706          
##                                           
##        'Positive' Class : 0               
## 

Obtenemos un valor del 76.86% para la precisión del modelo, con el incoveniente de tener un modelo sin poda, demasiado complejo y que puede tender al sobreajuste.

Realizamos la valoración para una posible poda del modelo que permita simplificarlo y hacerlo más explicativo sin perder capacidad predictora. Para ello vemos el CP o “Parámetro de complejidad” con el cual buscamos el árbol menos profundo que además minimice la tasa de error.

plotcp(tree) #CP - PARÁMETRO DE COMPLEJIDAD: Buscamos el árbol menos profundo que además minimiza la tasa de error

printcp(tree)
## 
## Classification tree:
## rpart(formula = as.factor(nota_vino) ~ ., data = train_tree, 
##     cp = 0.006)
## 
## Variables actually used in tree construction:
## [1] alcohol                  fixed_acidity            Log_chlorides           
## [4] Log_sulphates            Log_total_sulfur_dioxide volatile_acidity        
## 
## Root node error: 597/1279 = 0.46677
## 
## n= 1279 
## 
##          CP nsplit rel error  xerror     xstd
## 1 0.3668342      0   1.00000 1.00000 0.029886
## 2 0.0452261      1   0.63317 0.64154 0.027437
## 3 0.0201005      3   0.54271 0.58291 0.026660
## 4 0.0134003      4   0.52261 0.58961 0.026755
## 5 0.0067002      5   0.50921 0.57621 0.026563
## 6 0.0060000      7   0.49581 0.58291 0.026660

Finalmente decimos proceder a realizar la poda y crear un modelo alternativo más simplificado:

xerror <- tree$cptable[, "xerror"]
imin.xerror <- which.min(xerror)
upper.xerror <- xerror[imin.xerror] + tree$cptable[imin.xerror,
    "xstd"]
icp <- min(which(xerror <= upper.xerror))
cp <- tree$cptable[icp, "CP"]
cp
## [1] 0.0201005
tree_2 <- prune(tree, cp = cp)
# tree summary(tree) caret::varImp(tree) importance <-
# tree$variable.importance importance <-
# round(100*importance/sum(importance), 1)
# importance[importance >= 1]
rpart.plot(tree_2, nn = TRUE, extra = 104, box.palette = "GnBu",
    branch.lty = 3, shadow.col = "gray")  #, main='Classification tree winetaste'

  • Modelo Decision Tree con poda:
obs_tree2 <- as.factor(train_tree$nota_vino)
head(predict(tree_2, newdata = train_tree))
##           0         1
## 1 0.7388664 0.2611336
## 2 0.7388664 0.2611336
## 3 0.7388664 0.2611336
## 4 0.3516484 0.6483516
## 5 0.6796117 0.3203883
## 6 0.7388664 0.2611336
pred_tree2 <- predict(tree_2, newdata = train_tree, type = "class")
table(obs_tree2, pred_tree2)
##          pred_tree2
## obs_tree2   0   1
##         0 435 162
##         1 162 520
caret::confusionMatrix(pred_tree2, obs_tree2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 435 162
##          1 162 520
##                                           
##                Accuracy : 0.7467          
##                  95% CI : (0.7219, 0.7703)
##     No Information Rate : 0.5332          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4911          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7286          
##             Specificity : 0.7625          
##          Pos Pred Value : 0.7286          
##          Neg Pred Value : 0.7625          
##              Prevalence : 0.4668          
##          Detection Rate : 0.3401          
##    Detection Prevalence : 0.4668          
##       Balanced Accuracy : 0.7456          
##                                           
##        'Positive' Class : 0               
## 

Aplicando la poda a nuestro árbol obtenemos un modelo mas limpio, simple, explicativo y generalizable a otro conjunto de datos, evitando el posible sobreajuste del modelo y solo reduciendo su capacidad predictora a un valor de precisión del 74.67%. Entendemos que este modelo podado será el óptimo en este caso.

10.3.1. DECISION TREE - Cross Validation, Hiperparámetros y Evaluación del modelo

De cara a obtener el mejor modelo posible realizaremos validación cruzada de 5 folds y trataremos de ajustar hiperparámetros (el “cp” óptimo para un modelo ya validado). Utilizamos además las variables que hemos vito como más representativas y explicativas de la variable respuesta “nota_vino”.

# Fit the model on the training set
set.seed(1234)
caret.tree <- train(as.factor(nota_vino) ~ alcohol + volatile_acidity +
    Log_sulphates + Log_total_sulfur_dioxide, data = train_tree,
    method = "rpart", trControl = trainControl("cv", number = 5),
    tuneLength = 20)
# Plot model accuracy vs different values of cp (complexity
# parameter)
plot(caret.tree)

caret.tree
## CART 
## 
## 1279 samples
##    4 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1022, 1024, 1023, 1023, 1024 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.00000000  0.7239386  0.4455300
##   0.01930706  0.7255101  0.4484141
##   0.03861412  0.7215885  0.4421371
##   0.05792118  0.7004855  0.4069132
##   0.07722825  0.7004855  0.4069132
##   0.09653531  0.7004855  0.4069132
##   0.11584237  0.7004855  0.4069132
##   0.13514943  0.7004855  0.4069132
##   0.15445649  0.7004855  0.4069132
##   0.17376355  0.7004855  0.4069132
##   0.19307062  0.7004855  0.4069132
##   0.21237768  0.7004855  0.4069132
##   0.23168474  0.7004855  0.4069132
##   0.25099180  0.7004855  0.4069132
##   0.27029886  0.7004855  0.4069132
##   0.28960592  0.7004855  0.4069132
##   0.30891299  0.7004855  0.4069132
##   0.32822005  0.7004855  0.4069132
##   0.34752711  0.7004855  0.4069132
##   0.36683417  0.6614230  0.3137864
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01930706.
caret.tree$bestTune
##           cp
## 2 0.01930706

Realizando la validación cruzada vemos que el CP óptimo para nuestro modelo de árbol de decisión se encuentra en 0.01930706.

Visualizamos graficamente el árbol obtenido:

# Plot the final tree model
par(xpd = NA) # Avoid clipping the text in some device
plot(caret.tree$finalModel,uniform=TRUE)
text(caret.tree$finalModel,  digits = 10)

get_best_result = function(caret_fit) {
    best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
    best_result = caret_fit$results[best, ]
    rownames(best_result) = NULL
    best_result
}

get_best_result(caret.tree)
##           cp  Accuracy     Kappa AccuracySD    KappaSD
## 1 0.01930706 0.7255101 0.4484141 0.02542086 0.05203991

Obtenemos finalmente haciendo validación cruzada una precisión del 72/73%, con un modelo que ha sido comprobado como robusto y generalizable para funcionar previsiblemente en otro conjunto de datos diferente.

Evaluación del rendimiento predictivo del modelo Decision Tree presentado con las datos de train:

train_tree$y_pred_probs2 <- predict(caret.tree, newdata = train_tree,
    type = "prob")
train_tree$y_pred_probs2 <- ifelse(train_tree$y_pred_probs2$`1` >
    0.5, train_tree$y_pred_probs2$`1`, 1 - train_tree$y_pred_probs2$`0`)

train_tree$y_pred2 <- ifelse(train_tree$y_pred_probs2 > 0.5,
    1, 0)

# train_tree$y_pred_probs2 train_tree$y_pred2

Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de Decision Tree obtenido:

cm_train_tree <- confusionMatrix(as.factor(train_tree$y_pred2),
    as.factor(train_tree$nota_vino), positive = "1")
cm_train_tree$table
##           Reference
## Prediction   0   1
##          0 453 168
##          1 144 514
# result
cm_train_tree$overall["Accuracy"] %>%
    round(2)
## Accuracy 
##     0.76
cm_train_tree$byClass["Recall"] %>%
    round(2)
## Recall 
##   0.75
cm_train_tree$byClass["Precision"] %>%
    round(2)
## Precision 
##      0.78

Reproducimos la curva ROC sobre el modelo final de Decision Tree obtenido:

roc_tree <- plot.roc(as.numeric(train_tree$nota_vino), as.numeric(train_tree$y_pred_probs2))

auc(roc_tree)
## Area under the curve: 0.7798

Se obtiene alrededor de un 78% de área bajo la curva.

10.4. RANDOM FOREST

En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).

train_forest <- train %>%
    mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
    mutate(quality = NULL)
train_forest
## # A tibble: 1,279 x 12
##    fixed_acidity volatile_acidity citric_acid density alcohol    pH
##            <dbl>            <dbl>       <dbl>   <dbl>   <dbl> <dbl>
##  1           7.1            0.48         0.28   0.997    10.3  3.24
##  2           7.6            0.49         0.33   0.997     9    3.3 
##  3           5              1.02         0.04   0.994    10.5  3.75
##  4           7.6            0.43         0.29   0.997     9.5  3.4 
##  5           6.8            0.59         0.1    0.996     9.7  3.3 
##  6           6.8            0.815        0      0.995     9.8  3.3 
##  7           8.5            0.21         0.52   0.996    10.4  3.36
##  8           7.4            0.36         0.29   0.996    11    3.3 
##  9           5.5            0.49         0.03   0.991    14    3.3 
## 10           6.8            0.49         0.22   0.994    11.3  3.41
## # ... with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## #   Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## #   Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_forest$nota_vino)
## 
##   0   1 
## 597 682
str(train_forest)
## tibble [1,279 x 12] (S3: tbl_df/tbl/data.frame)
##  $ fixed_acidity           : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
##  $ volatile_acidity        : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
##  $ citric_acid             : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
##  $ density                 : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
##  $ alcohol                 : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
##  $ pH                      : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
##  $ Log_residual_sugar      : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
##  $ Log_chlorides           : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
##  $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
##  $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
##  $ Log_sulphates           : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
##  $ nota_vino               : num [1:1279] 0 0 0 0 0 0 0 0 1 1 ...
# train_forest <- train[, colnames(train)!='quality']
# train_forest$nota_vino <- factor(train$quality < 6,
# labels = c('aprobado', 'suspenso')) train_forest
# str(train_forest)

Creamos el modelo base de bosque de árboles:

set.seed(4343)
rf <- randomForest(as.factor(nota_vino) ~ ., data = train_forest,
    ntree = 200, do.trace = T, importance = T)
## ntree      OOB      1      2
##     1:  29.89% 33.78% 26.40%
##     2:  30.76% 34.04% 27.58%
##     3:  30.18% 30.62% 29.76%
##     4:  28.32% 28.93% 27.78%
##     5:  27.30% 27.79% 26.85%
##     6:  27.06% 27.22% 26.92%
##     7:  27.06% 25.52% 28.42%
##     8:  26.05% 26.16% 25.94%
##     9:  24.38% 24.53% 24.25%
##    10:  25.18% 25.38% 25.00%
##    11:  23.62% 24.66% 22.70%
##    12:  24.19% 25.00% 23.49%
##    13:  23.64% 23.91% 23.42%
##    14:  22.55% 22.56% 22.53%
##    15:  22.21% 23.23% 21.32%
##    16:  23.82% 24.16% 23.53%
##    17:  23.00% 23.62% 22.47%
##    18:  22.14% 22.45% 21.88%
##    19:  22.22% 22.11% 22.32%
##    20:  21.21% 20.77% 21.59%
##    21:  21.60% 21.61% 21.59%
##    22:  21.50% 21.44% 21.55%
##    23:  21.19% 21.61% 20.82%
##    24:  21.27% 21.11% 21.41%
##    25:  21.11% 21.11% 21.11%
##    26:  20.88% 21.44% 20.38%
##    27:  21.19% 21.78% 20.67%
##    28:  20.64% 20.77% 20.53%
##    29:  21.27% 21.61% 20.97%
##    30:  20.88% 21.78% 20.09%
##    31:  21.03% 20.94% 21.11%
##    32:  21.03% 21.94% 20.23%
##    33:  20.02% 20.60% 19.50%
##    34:  20.95% 21.44% 20.53%
##    35:  20.56% 21.27% 19.94%
##    36:  21.03% 21.94% 20.23%
##    37:  20.64% 20.94% 20.38%
##    38:  21.03% 21.11% 20.97%
##    39:  20.95% 21.11% 20.82%
##    40:  20.95% 20.94% 20.97%
##    41:  20.17% 20.77% 19.65%
##    42:  20.41% 20.60% 20.23%
##    43:  20.09% 20.44% 19.79%
##    44:  20.64% 20.94% 20.38%
##    45:  20.25% 20.77% 19.79%
##    46:  20.56% 21.11% 20.09%
##    47:  20.25% 20.77% 19.79%
##    48:  20.09% 20.44% 19.79%
##    49:  19.47% 20.27% 18.77%
##    50:  19.62% 20.10% 19.21%
##    51:  19.62% 19.93% 19.35%
##    52:  19.62% 20.44% 18.91%
##    53:  19.62% 20.60% 18.77%
##    54:  19.86% 20.44% 19.35%
##    55:  19.62% 20.77% 18.62%
##    56:  20.09% 21.27% 19.06%
##    57:  19.62% 20.60% 18.77%
##    58:  19.70% 20.10% 19.35%
##    59:  20.25% 21.11% 19.50%
##    60:  19.78% 20.27% 19.35%
##    61:  19.47% 19.77% 19.21%
##    62:  19.31% 19.93% 18.77%
##    63:  19.23% 19.43% 19.06%
##    64:  19.31% 20.10% 18.62%
##    65:  19.70% 20.10% 19.35%
##    66:  19.39% 20.10% 18.77%
##    67:  19.62% 20.44% 18.91%
##    68:  19.70% 20.27% 19.21%
##    69:  19.78% 20.60% 19.06%
##    70:  19.55% 19.93% 19.21%
##    71:  19.78% 20.10% 19.50%
##    72:  20.41% 20.94% 19.94%
##    73:  19.86% 20.44% 19.35%
##    74:  19.78% 20.94% 18.77%
##    75:  20.09% 20.94% 19.35%
##    76:  20.09% 20.77% 19.50%
##    77:  19.70% 20.44% 19.06%
##    78:  19.78% 20.44% 19.21%
##    79:  19.70% 20.44% 19.06%
##    80:  19.86% 20.44% 19.35%
##    81:  19.23% 20.10% 18.48%
##    82:  19.78% 20.60% 19.06%
##    83:  19.78% 20.60% 19.06%
##    84:  19.86% 20.60% 19.21%
##    85:  19.78% 20.44% 19.21%
##    86:  19.62% 20.44% 18.91%
##    87:  19.55% 20.27% 18.91%
##    88:  19.94% 20.44% 19.50%
##    89:  20.02% 20.44% 19.65%
##    90:  19.55% 20.10% 19.06%
##    91:  19.86% 21.11% 18.77%
##    92:  19.78% 20.44% 19.21%
##    93:  19.55% 19.93% 19.21%
##    94:  19.94% 20.77% 19.21%
##    95:  20.17% 20.94% 19.50%
##    96:  19.78% 20.77% 18.91%
##    97:  19.39% 20.60% 18.33%
##    98:  19.55% 20.60% 18.62%
##    99:  19.23% 20.10% 18.48%
##   100:  19.31% 20.10% 18.62%
##   101:  19.47% 20.44% 18.62%
##   102:  19.47% 20.10% 18.91%
##   103:  19.55% 19.93% 19.21%
##   104:  19.47% 19.93% 19.06%
##   105:  19.55% 19.93% 19.21%
##   106:  19.70% 19.93% 19.50%
##   107:  19.39% 19.77% 19.06%
##   108:  19.70% 20.27% 19.21%
##   109:  19.55% 19.77% 19.35%
##   110:  19.78% 20.44% 19.21%
##   111:  19.55% 20.60% 18.62%
##   112:  19.62% 20.27% 19.06%
##   113:  19.62% 20.27% 19.06%
##   114:  19.78% 20.27% 19.35%
##   115:  19.86% 20.27% 19.50%
##   116:  19.78% 20.10% 19.50%
##   117:  19.62% 19.60% 19.65%
##   118:  19.39% 19.43% 19.35%
##   119:  19.62% 19.77% 19.50%
##   120:  19.86% 20.10% 19.65%
##   121:  19.86% 20.10% 19.65%
##   122:  19.94% 20.44% 19.50%
##   123:  19.86% 20.10% 19.65%
##   124:  19.47% 19.93% 19.06%
##   125:  19.86% 20.27% 19.50%
##   126:  19.62% 20.10% 19.21%
##   127:  19.86% 19.93% 19.79%
##   128:  19.62% 20.27% 19.06%
##   129:  20.17% 20.77% 19.65%
##   130:  19.86% 20.44% 19.35%
##   131:  19.62% 20.60% 18.77%
##   132:  19.55% 20.27% 18.91%
##   133:  19.70% 20.10% 19.35%
##   134:  19.47% 20.27% 18.77%
##   135:  19.70% 20.27% 19.21%
##   136:  19.70% 20.60% 18.91%
##   137:  19.62% 20.77% 18.62%
##   138:  19.62% 20.27% 19.06%
##   139:  19.70% 20.77% 18.77%
##   140:  20.02% 20.94% 19.21%
##   141:  20.17% 21.11% 19.35%
##   142:  19.94% 20.60% 19.35%
##   143:  19.62% 20.60% 18.77%
##   144:  19.55% 20.44% 18.77%
##   145:  19.86% 20.77% 19.06%
##   146:  19.86% 20.60% 19.21%
##   147:  19.62% 19.93% 19.35%
##   148:  20.02% 20.60% 19.50%
##   149:  19.39% 19.60% 19.21%
##   150:  19.70% 19.93% 19.50%
##   151:  19.55% 19.60% 19.50%
##   152:  19.62% 19.77% 19.50%
##   153:  19.47% 19.77% 19.21%
##   154:  19.16% 19.60% 18.77%
##   155:  19.23% 19.60% 18.91%
##   156:  19.70% 20.10% 19.35%
##   157:  19.39% 19.60% 19.21%
##   158:  19.47% 19.60% 19.35%
##   159:  19.55% 19.93% 19.21%
##   160:  19.70% 20.10% 19.35%
##   161:  19.78% 20.27% 19.35%
##   162:  19.70% 20.10% 19.35%
##   163:  19.86% 20.27% 19.50%
##   164:  19.94% 20.77% 19.21%
##   165:  20.02% 20.77% 19.35%
##   166:  20.09% 20.77% 19.50%
##   167:  20.33% 21.11% 19.65%
##   168:  20.09% 20.94% 19.35%
##   169:  20.09% 20.60% 19.65%
##   170:  20.09% 20.94% 19.35%
##   171:  20.17% 20.77% 19.65%
##   172:  19.86% 20.44% 19.35%
##   173:  19.78% 20.27% 19.35%
##   174:  19.70% 20.27% 19.21%
##   175:  19.70% 20.27% 19.21%
##   176:  19.55% 20.10% 19.06%
##   177:  19.47% 19.93% 19.06%
##   178:  19.55% 20.10% 19.06%
##   179:  19.39% 19.93% 18.91%
##   180:  19.62% 20.44% 18.91%
##   181:  19.78% 20.44% 19.21%
##   182:  19.78% 20.44% 19.21%
##   183:  19.70% 20.27% 19.21%
##   184:  19.86% 20.27% 19.50%
##   185:  19.70% 20.27% 19.21%
##   186:  19.78% 20.10% 19.50%
##   187:  20.02% 20.27% 19.79%
##   188:  20.02% 20.60% 19.50%
##   189:  20.09% 20.60% 19.65%
##   190:  19.94% 20.27% 19.65%
##   191:  19.86% 20.44% 19.35%
##   192:  20.02% 20.44% 19.65%
##   193:  19.94% 20.27% 19.65%
##   194:  19.94% 20.44% 19.50%
##   195:  19.78% 20.10% 19.50%
##   196:  19.94% 20.44% 19.50%
##   197:  19.78% 20.44% 19.21%
##   198:  19.86% 20.44% 19.35%
##   199:  19.62% 20.44% 18.91%
##   200:  19.94% 20.44% 19.50%
rf
## 
## Call:
##  randomForest(formula = as.factor(nota_vino) ~ ., data = train_forest,      ntree = 200, do.trace = T, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 19.94%
## Confusion matrix:
##     0   1 class.error
## 0 475 122   0.2043551
## 1 133 549   0.1950147

Examinamos la convergencia del error en las muestras:

plot(rf,main="")
legend("right", colnames(rf$err.rate), lty = 1:5, col = 1:6)

Vemos la relevancia de las variables en el modelo (vemos que la variable clave que más afecta al accuracy del modelo es “alcohol”)

varImpPlot(rf)

10.4.1. RANDOM FOREST - Cross Validation, Hiperparámetros y Evaluación del modelo

Vemos que el principal pa´rametro a configurar es el número de predictores al azar que toma el modelo.

modelLookup("rf")
##   model parameter                         label forReg forClass probModel
## 1    rf      mtry #Randomly Selected Predictors   TRUE     TRUE      TRUE

Creamos un modelo aplicando la validación cruzada y ajustando hiperparámetros (mtry, número de árboles y el tamaño de los nodos para regular su profundidad) de tal forma que creemos un modelo robusto y generalizable. Tomamos como base las 4 variable de mayor relevancia que hemos observado:

# Fit the model on the training set
set.seed(12345)

caret.rf <- train(as.factor(nota_vino) ~ alcohol + volatile_acidity +
    Log_sulphates + Log_total_sulfur_dioxide, data = train_forest,
    method = "rf", ntree = 200, importance = TRUE, metric = "Accuracy",
    trControl = trainControl("cv", number = 5, search = "grid"),
    nodesize = 50, tuneLength = 10)
## note: only 3 unique complexity parameters in default grid. Truncating the grid to 3 .
plot(caret.rf)

caret.rf
## Random Forest 
## 
## 1279 samples
##    4 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1023, 1022, 1023, 1024, 1024 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.7654185  0.5293747
##   3     0.7583872  0.5158127
##   4     0.7552408  0.5096077
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
caret.rf$bestTune
##   mtry
## 1    2
get_best_result = function(caret_fit) {
    best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
    best_result = caret_fit$results[best, ]
    rownames(best_result) = NULL
    best_result
}

get_best_result(caret.rf)
##   mtry  Accuracy     Kappa AccuracySD    KappaSD
## 1    2 0.7654185 0.5293747 0.02794045 0.05625007

Evaluación del rendimiento predictivo del modelo Random Forest presentado con las datos de train:

train_forest$y_pred_probs2 <- predict(caret.rf, newdata = train_forest,
    type = "prob")
train_forest$y_pred_probs2 <- ifelse(train_forest$y_pred_probs2$`1` >
    0.5, train_forest$y_pred_probs2$`1`, 1 - train_forest$y_pred_probs2$`0`)

train_forest$y_pred2 <- ifelse(train_forest$y_pred_probs2 > 0.5,
    1, 0)

# train_forest$y_pred_probs2
train_forest$y_pred2
##    [1] 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1 0 0 1
##   [38] 0 1 0 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 0 0 1 0 1 1 1 0 0
##   [75] 0 1 0 1 1 0 0 1 1 1 0 0 0 0 0 1 0 1 1 1 0 0 1 1 1 1 0 1 0 0 0 1 1 0 0 1 1
##  [112] 1 0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 1 1 0 0
##  [149] 1 1 0 0 1 1 0 0 1 1 1 1 1 0 0 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 1 1 0 0 1 1 1
##  [186] 1 1 1 1 0 0 1 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 1 1 1 1 1 0 1 1 1
##  [223] 0 0 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 0 1 0 1 1 0 0 1 1 0 1 1 1 1 0 1 0 1 1
##  [260] 0 0 1 0 0 0 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1 1 1 0 1 0 0 0 1 1 1 0 1 0 1
##  [297] 1 1 0 0 1 0 0 1 1 1 1 1 0 1 0 1 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0
##  [334] 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 1 0 1 1 1 0
##  [371] 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 1 0 1 1 1 1 0 1 0 0 0 1 0 0 1 1 1 0 1 1 1
##  [408] 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 1 0 1 1
##  [445] 1 1 1 0 1 1 1 1 0 1 0 1 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 0
##  [482] 0 0 1 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 1
##  [519] 0 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1
##  [556] 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 1 0 0 1
##  [593] 0 0 1 1 0 0 0 0 1 1 1 1 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 0 0 0 1
##  [630] 1 1 0 1 1 1 1 0 1 0 1 1 0 0 1 0 1 1 0 1 1 0 1 0 0 0 1 1 1 0 0 1 1 1 1 0 1
##  [667] 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 1 0 1 0 0 1 1 0 1 1 1 0 1 1 0 0 0 0
##  [704] 1 1 0 1 0 0 1 0 0 1 1 1 1 1 0 1 1 1 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0
##  [741] 1 0 1 0 1 1 0 0 0 1 1 1 0 0 1 0 1 0 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 0 0 1 0
##  [778] 1 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 1 0 1 1 1 0 1 1 1 1 0 0 1 0 1 0 0 1 0 0
##  [815] 0 0 1 1 0 0 1 0 0 0 1 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 0 1 1
##  [852] 0 0 1 1 0 1 0 1 0 0 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0 0
##  [889] 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 0 1 0 1 1 0 0 1
##  [926] 1 1 1 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 0 1 1 0 1
##  [963] 0 0 1 1 1 1 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0
## [1000] 1 0 1 0 0 0 0 0 0 1 1 0 1 0 1 1 1 1 0 1 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1
## [1037] 1 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1
## [1074] 0 0 1 0 1 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 1 1 1 0 1 0 1 0 1 1 0
## [1111] 0 0 0 1 0 0 1 1 1 1 1 1 0 1 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0
## [1148] 0 1 1 0 0 0 1 0 1 0 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 0 1 1 1 0 1 1 0 1 0 0
## [1185] 0 0 0 1 1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 1 0 0 1 0 0 0 1 1 1
## [1222] 0 0 0 1 0 1 1 1 0 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 1
## [1259] 1 1 0 0 1 1 0 0 1 1 1 1 1 1 0 1 0 1 0 0 1
train_forest
## # A tibble: 1,279 x 14
##    fixed_acidity volatile_acidity citric_acid density alcohol    pH
##            <dbl>            <dbl>       <dbl>   <dbl>   <dbl> <dbl>
##  1           7.1            0.48         0.28   0.997    10.3  3.24
##  2           7.6            0.49         0.33   0.997     9    3.3 
##  3           5              1.02         0.04   0.994    10.5  3.75
##  4           7.6            0.43         0.29   0.997     9.5  3.4 
##  5           6.8            0.59         0.1    0.996     9.7  3.3 
##  6           6.8            0.815        0      0.995     9.8  3.3 
##  7           8.5            0.21         0.52   0.996    10.4  3.36
##  8           7.4            0.36         0.29   0.996    11    3.3 
##  9           5.5            0.49         0.03   0.991    14    3.3 
## 10           6.8            0.49         0.22   0.994    11.3  3.41
## # ... with 1,269 more rows, and 8 more variables: Log_residual_sugar <dbl>,
## #   Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## #   Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>,
## #   y_pred_probs2 <dbl>, y_pred2 <dbl>

Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de Random Forest obtenido:

cm_train_forest <- confusionMatrix(as.factor(train_forest$y_pred2),
    as.factor(train_forest$nota_vino), positive = "1")
cm_train_forest$table
##           Reference
## Prediction   0   1
##          0 492 126
##          1 105 556
# result
cm_train_forest$overall["Accuracy"] %>%
    round(2)
## Accuracy 
##     0.82
cm_train_forest$byClass["Recall"] %>%
    round(2)
## Recall 
##   0.82
cm_train_forest$byClass["Precision"] %>%
    round(2)
## Precision 
##      0.84

Reproducimos la curva ROC sobre el modelo final de Decision Tree obtenido:

roc_rf <- plot.roc(as.numeric(train_forest$nota_vino), as.numeric(train_forest$y_pred_probs2))

auc(roc_rf)
## Area under the curve: 0.901

Comprobación del modelo con los datos de test:

Pasamos a validar la capacidad predictora de nuestro modelo de árbol de decisión con el conjunto de datos de test. Para ello lo primero de todo, creamos de nuevo la variable binaria “nota_vino” sobre nuestro conjunto de datos en test.

test_forest <- test %>%
    mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
    mutate(quality = NULL)
test_forest
## # A tibble: 320 x 12
##    fixed_acidity volatile_acidity citric_acid density alcohol    pH
##            <dbl>            <dbl>       <dbl>   <dbl>   <dbl> <dbl>
##  1           7.4             0.7         0      0.998     9.4  3.51
##  2           7.3             0.65        0      0.995    10    3.39
##  3           8.9             0.22        0.48   0.997     9.4  3.39
##  4           7.6             0.41        0.24   0.996     9.5  3.28
##  5           7.1             0.71        0      0.997     9.4  3.47
##  6           5.7             1.13        0.09   0.994     9.8  3.5 
##  7           7.3             0.45        0.36   0.998    10.5  3.33
##  8           8.1             0.66        0.22   0.997    10.3  3.3 
##  9           6.8             0.67        0.02   0.996     9.5  3.48
## 10           5.6             0.31        0.37   0.995     9.2  3.32
## # ... with 310 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## #   Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## #   Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(test_forest$nota_vino)
## 
##   0   1 
## 147 173
str(test_forest)
## tibble [320 x 12] (S3: tbl_df/tbl/data.frame)
##  $ fixed_acidity           : num [1:320] 7.4 7.3 8.9 7.6 7.1 5.7 7.3 8.1 6.8 5.6 ...
##  $ volatile_acidity        : num [1:320] 0.7 0.65 0.22 0.41 0.71 1.13 0.45 0.66 0.67 0.31 ...
##  $ citric_acid             : num [1:320] 0 0 0.48 0.24 0 0.09 0.36 0.22 0.02 0.37 ...
##  $ density                 : num [1:320] 0.998 0.995 0.997 0.996 0.997 ...
##  $ alcohol                 : num [1:320] 9.4 10 9.4 9.5 9.4 9.8 10.5 10.3 9.5 9.2 ...
##  $ pH                      : num [1:320] 3.51 3.39 3.39 3.28 3.47 3.5 3.33 3.3 3.48 3.32 ...
##  $ Log_residual_sugar      : num [1:320] 0.642 0.182 0.588 0.588 0.642 ...
##  $ Log_chlorides           : num [1:320] -2.58 -2.73 -2.56 -2.53 -2.53 ...
##  $ Log_free_sulfur_dioxide : num [1:320] 2.4 2.71 3.37 1.39 2.64 ...
##  $ Log_total_sulfur_dioxide: num [1:320] 3.53 3.04 4.09 2.4 3.56 ...
##  $ Log_sulphates           : num [1:320] -0.58 -0.755 -0.635 -0.528 -0.598 ...
##  $ nota_vino               : num [1:320] 0 1 1 0 0 0 0 0 0 0 ...
test_forest$y_pred_probs2 <- predict(caret.rf, newdata = test_forest,
    type = "prob")
test_forest$y_pred_probs2 <- ifelse(test_forest$y_pred_probs2$`1` >
    0.5, test_forest$y_pred_probs2$`1`, 1 - test_forest$y_pred_probs2$`0`)

test_forest$y_pred2 <- ifelse(test_forest$y_pred_probs2 > 0.5,
    1, 0)

# test_forest$y_pred_probs2
test_forest$y_pred2
##   [1] 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0
##  [38] 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 1 0 1 0 1 1 0 0 0 0 0
##  [75] 0 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 0 0
## [112] 0 1 0 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1
## [149] 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 0 0 0 1 1 1 1 1 1 1 1 1
## [186] 1 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 1 1 0 0 0 0 1 1 0 0 0 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 1 1 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0
## [260] 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 0
## [297] 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 0 1 0 1
test_forest
## # A tibble: 320 x 14
##    fixed_acidity volatile_acidity citric_acid density alcohol    pH
##            <dbl>            <dbl>       <dbl>   <dbl>   <dbl> <dbl>
##  1           7.4             0.7         0      0.998     9.4  3.51
##  2           7.3             0.65        0      0.995    10    3.39
##  3           8.9             0.22        0.48   0.997     9.4  3.39
##  4           7.6             0.41        0.24   0.996     9.5  3.28
##  5           7.1             0.71        0      0.997     9.4  3.47
##  6           5.7             1.13        0.09   0.994     9.8  3.5 
##  7           7.3             0.45        0.36   0.998    10.5  3.33
##  8           8.1             0.66        0.22   0.997    10.3  3.3 
##  9           6.8             0.67        0.02   0.996     9.5  3.48
## 10           5.6             0.31        0.37   0.995     9.2  3.32
## # ... with 310 more rows, and 8 more variables: Log_residual_sugar <dbl>,
## #   Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## #   Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>,
## #   y_pred_probs2 <dbl>, y_pred2 <dbl>

Reproducimos la matriz de confusión y las métricas de evaluación sobre el modelo final de Random Forest obtenido:

cm_test_forest <- confusionMatrix(as.factor(test_forest$y_pred2),
    as.factor(test_forest$nota_vino), positive = "1")
cm_test_forest$table
##           Reference
## Prediction   0   1
##          0 110  55
##          1  37 118
# result
cm_test_forest$overall["Accuracy"] %>%
    round(2)
## Accuracy 
##     0.71
cm_test_forest$byClass["Recall"] %>%
    round(2)
## Recall 
##   0.68
cm_test_forest$byClass["Precision"] %>%
    round(2)
## Precision 
##      0.76

10.5. MÉTODOS DE ENSAMBLE

10.5.1. ADABoost - Boosted Classification Tree

https://rubenfcasal.github.io/aprendizaje_estadistico/boosting-en-r.html

En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).

train_en <- train[, colnames(train) != "quality"]
train_en$nota_vino <- factor(train$quality < 6, labels = c("aprobado",
    "suspenso"))  # levels = c('FALSE', 'TRUE')
str(train_en)
## tibble [1,279 x 12] (S3: tbl_df/tbl/data.frame)
##  $ fixed_acidity           : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
##  $ volatile_acidity        : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
##  $ citric_acid             : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
##  $ density                 : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
##  $ alcohol                 : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
##  $ pH                      : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
##  $ Log_residual_sugar      : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
##  $ Log_chlorides           : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
##  $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
##  $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
##  $ Log_sulphates           : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
##  $ nota_vino               : Factor w/ 2 levels "aprobado","suspenso": 2 2 2 2 2 2 2 2 1 1 ...

Creamos el modelo de boosting con una configuración inicial de parámetros:

ada.boost <- ada(nota_vino ~ ., data = train_en, type = "real",
             control = rpart.control(maxdepth = 2, cp = 0, minsplit = 10, xval = 0),
             iter = 150, nu = 0.05)
ada.boost
## Call:
## ada(nota_vino ~ ., data = train_en, type = "real", control = rpart.control(maxdepth = 2, 
##     cp = 0, minsplit = 10, xval = 0), iter = 150, nu = 0.05)
## 
## Loss: exponential Method: real   Iteration: 150 
## 
## Final Confusion Matrix for Data:
##           Final Prediction
## True value aprobado suspenso
##   aprobado      531      151
##   suspenso      120      477
## 
## Train Error: 0.212 
## 
## Out-Of-Bag Error:  0.22  iteration= 147 
## 
## Additional Estimates of number of iterations:
## 
## train.err1 train.kap1 
##        144        144

Vemos la evolución decreciente del error al aumentar el número de iteraciones en el modelo

plot(ada.boost)

Evaluamos la precisión del modelo en la muestra de test:

test_en <- test[, colnames(test) != "quality"]
test_en$nota_vino <- factor(test$quality < 6, labels = c("aprobado",
    "suspenso"))  # levels = c('FALSE', 'TRUE')
str(test_en)
## tibble [320 x 12] (S3: tbl_df/tbl/data.frame)
##  $ fixed_acidity           : num [1:320] 7.4 7.3 8.9 7.6 7.1 5.7 7.3 8.1 6.8 5.6 ...
##  $ volatile_acidity        : num [1:320] 0.7 0.65 0.22 0.41 0.71 1.13 0.45 0.66 0.67 0.31 ...
##  $ citric_acid             : num [1:320] 0 0 0.48 0.24 0 0.09 0.36 0.22 0.02 0.37 ...
##  $ density                 : num [1:320] 0.998 0.995 0.997 0.996 0.997 ...
##  $ alcohol                 : num [1:320] 9.4 10 9.4 9.5 9.4 9.8 10.5 10.3 9.5 9.2 ...
##  $ pH                      : num [1:320] 3.51 3.39 3.39 3.28 3.47 3.5 3.33 3.3 3.48 3.32 ...
##  $ Log_residual_sugar      : num [1:320] 0.642 0.182 0.588 0.588 0.642 ...
##  $ Log_chlorides           : num [1:320] -2.58 -2.73 -2.56 -2.53 -2.53 ...
##  $ Log_free_sulfur_dioxide : num [1:320] 2.4 2.71 3.37 1.39 2.64 ...
##  $ Log_total_sulfur_dioxide: num [1:320] 3.53 3.04 4.09 2.4 3.56 ...
##  $ Log_sulphates           : num [1:320] -0.58 -0.755 -0.635 -0.528 -0.598 ...
##  $ nota_vino               : Factor w/ 2 levels "aprobado","suspenso": 2 1 1 2 2 2 2 2 2 2 ...
set.seed(123)
pred_ada <- predict(ada.boost, newdata = test_en)
caret::confusionMatrix(pred_ada, test_en$nota_vino, positive = "aprobado")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction aprobado suspenso
##   aprobado      118       40
##   suspenso       55      107
##                                           
##                Accuracy : 0.7031          
##                  95% CI : (0.6498, 0.7527)
##     No Information Rate : 0.5406          
##     P-Value [Acc > NIR] : 2.039e-09       
##                                           
##                   Kappa : 0.4069          
##                                           
##  Mcnemar's Test P-Value : 0.1509          
##                                           
##             Sensitivity : 0.6821          
##             Specificity : 0.7279          
##          Pos Pred Value : 0.7468          
##          Neg Pred Value : 0.6605          
##              Prevalence : 0.5406          
##          Detection Rate : 0.3688          
##    Detection Prevalence : 0.4938          
##       Balanced Accuracy : 0.7050          
##                                           
##        'Positive' Class : aprobado        
## 

Con la configuración de parámetros realizada en el modelo ada de booting obtenemos un valor de accuracy del 70,31% para el caso de algoritmos de clasificación.

Para optimizar los resultados del modelo creado, se puede realizar un ajuste de hiperparámetros:

modelLookup("ada")
##   model parameter          label forReg forClass probModel
## 1   ada      iter         #Trees  FALSE     TRUE      TRUE
## 2   ada  maxdepth Max Tree Depth  FALSE     TRUE      TRUE
## 3   ada        nu  Learning Rate  FALSE     TRUE      TRUE

Vemos los parámetros de “iter”, “maxdepth” y “nu” que tiene el modelo ada de boosting para árboles de decisión en problemas de clasificación.

set.seed(123)
caret.ada0 <- train(nota_vino ~ ., method = "ada", data = train_en,
                   trControl = trainControl(method = "cv", number = 5))
caret.ada0
## Boosted Classification Trees 
## 
## 1279 samples
##   11 predictor
##    2 classes: 'aprobado', 'suspenso' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1024, 1024, 1023, 1023, 1022 
## Resampling results across tuning parameters:
## 
##   maxdepth  iter  Accuracy   Kappa    
##   1          50   0.7397072  0.4808586
##   1         100   0.7600076  0.5185542
##   1         150   0.7623483  0.5226994
##   2          50   0.7514108  0.5003883
##   2         100   0.7654825  0.5288044
##   2         150   0.7592111  0.5162555
##   3          50   0.7584145  0.5153590
##   3         100   0.7583932  0.5153576
##   3         150   0.7544839  0.5072981
## 
## Tuning parameter 'nu' was held constant at a value of 0.1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were iter = 100, maxdepth = 2 and nu = 0.1.

Obtenemos una configuración óptima de los hiperparámetros del modelo en “iter” = 100, “maxdepth” = 2 y “nu” = 0.1.

confusionMatrix(predict(caret.ada0, newdata = test_en), test_en$nota_vino, positive = "aprobado")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction aprobado suspenso
##   aprobado      120       42
##   suspenso       53      105
##                                           
##                Accuracy : 0.7031          
##                  95% CI : (0.6498, 0.7527)
##     No Information Rate : 0.5406          
##     P-Value [Acc > NIR] : 2.039e-09       
##                                           
##                   Kappa : 0.4056          
##                                           
##  Mcnemar's Test P-Value : 0.3049          
##                                           
##             Sensitivity : 0.6936          
##             Specificity : 0.7143          
##          Pos Pred Value : 0.7407          
##          Neg Pred Value : 0.6646          
##              Prevalence : 0.5406          
##          Detection Rate : 0.3750          
##    Detection Prevalence : 0.5062          
##       Balanced Accuracy : 0.7040          
##                                           
##        'Positive' Class : aprobado        
## 

Con el modelo de base obtenemos un accuracy del 70.31% con los datos de test.

Tratamos de añadir al modelo base la configuración de hiperparámetros óptima:

set.seed(123)
caret.ada1 <- train(nota_vino ~ ., method = "ada", data = train_en,
                    tuneGrid = data.frame(iter =  100, maxdepth = 2, nu = c(0.1)),
                    trControl = trainControl(method = "cv", number = 5))
caret.ada1
## Boosted Classification Trees 
## 
## 1279 samples
##   11 predictor
##    2 classes: 'aprobado', 'suspenso' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1024, 1024, 1023, 1023, 1022 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7631112  0.5249703
## 
## Tuning parameter 'iter' was held constant at a value of 100
## Tuning
##  parameter 'maxdepth' was held constant at a value of 2
## Tuning parameter
##  'nu' was held constant at a value of 0.1

Conseguimos un accuracy del 76.31% en train.

confusionMatrix(predict(caret.ada1, newdata = test_en), test_en$nota_vino, positive = "aprobado")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction aprobado suspenso
##   aprobado      123       40
##   suspenso       50      107
##                                           
##                Accuracy : 0.7188          
##                  95% CI : (0.6661, 0.7673)
##     No Information Rate : 0.5406          
##     P-Value [Acc > NIR] : 4.863e-11       
##                                           
##                   Kappa : 0.4366          
##                                           
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.7110          
##             Specificity : 0.7279          
##          Pos Pred Value : 0.7546          
##          Neg Pred Value : 0.6815          
##              Prevalence : 0.5406          
##          Detection Rate : 0.3844          
##    Detection Prevalence : 0.5094          
##       Balanced Accuracy : 0.7194          
##                                           
##        'Positive' Class : aprobado        
## 

Conseguimos un accuracy del 71.88% en test.

10.5.2. XGBoost - Extreme Gradient Boosting

En nuestro dataset de train, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).

train_xgb <- train[, colnames(train) != "quality"]
train_xgb$nota_vino <- factor(train$quality < 6, labels = c("aprobado",
    "suspenso"))  # levels = c('FALSE', 'TRUE')
str(train_xgb)
## tibble [1,279 x 12] (S3: tbl_df/tbl/data.frame)
##  $ fixed_acidity           : num [1:1279] 7.1 7.6 5 7.6 6.8 6.8 8.5 7.4 5.5 6.8 ...
##  $ volatile_acidity        : num [1:1279] 0.48 0.49 1.02 0.43 0.59 0.815 0.21 0.36 0.49 0.49 ...
##  $ citric_acid             : num [1:1279] 0.28 0.33 0.04 0.29 0.1 0 0.52 0.29 0.03 0.22 ...
##  $ density                 : num [1:1279] 0.997 0.997 0.994 0.997 0.996 ...
##  $ alcohol                 : num [1:1279] 10.3 9 10.5 9.5 9.7 9.8 10.4 11 14 11.3 ...
##  $ pH                      : num [1:1279] 3.24 3.3 3.75 3.4 3.3 3.3 3.36 3.3 3.3 3.41 ...
##  $ Log_residual_sugar      : num [1:1279] 1.03 0.642 0.336 0.742 0.531 ...
##  $ Log_chlorides           : num [1:1279] -2.69 -2.6 -3.1 -2.59 -2.76 ...
##  $ Log_free_sulfur_dioxide : num [1:1279] 1.79 3.3 3.71 2.94 3.53 ...
##  $ Log_total_sulfur_dioxide: num [1:1279] 2.77 4.44 4.44 4.19 3.97 ...
##  $ Log_sulphates           : num [1:1279] -0.635 -0.545 -0.478 -0.446 -0.4 ...
##  $ nota_vino               : Factor w/ 2 levels "aprobado","suspenso": 2 2 2 2 2 2 2 2 1 1 ...

Para optimizar los resultados del modelo creado, se puede realizar un ajuste de hiperparámetros:

modelLookup("xgbTree")
##     model        parameter                          label forReg forClass
## 1 xgbTree          nrounds          # Boosting Iterations   TRUE     TRUE
## 2 xgbTree        max_depth                 Max Tree Depth   TRUE     TRUE
## 3 xgbTree              eta                      Shrinkage   TRUE     TRUE
## 4 xgbTree            gamma         Minimum Loss Reduction   TRUE     TRUE
## 5 xgbTree colsample_bytree     Subsample Ratio of Columns   TRUE     TRUE
## 6 xgbTree min_child_weight Minimum Sum of Instance Weight   TRUE     TRUE
## 7 xgbTree        subsample           Subsample Percentage   TRUE     TRUE
##   probModel
## 1      TRUE
## 2      TRUE
## 3      TRUE
## 4      TRUE
## 5      TRUE
## 6      TRUE
## 7      TRUE

Creamos el modelo de boosting con una configuración inicial dada de parámetros:

set.seed(2)
caret.xgb <- train(nota_vino ~ ., method = "xgbTree", data = train_xgb,
    trControl = trainControl(method = "cv", number = 5))
## [15:03:52] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:52] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:52] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:52] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:52] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:52] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:52] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:52] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:52] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:52] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:52] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:52] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:52] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:52] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:52] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:52] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:53] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:53] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:53] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:53] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:53] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:53] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:53] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:53] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:53] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:53] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:54] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:54] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:54] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:54] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:54] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:54] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:54] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:54] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:54] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:54] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:55] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:55] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:55] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:55] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:55] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:55] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:55] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:55] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:55] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:55] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:55] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:55] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:55] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:55] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:55] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:55] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:56] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:56] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:56] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:56] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:56] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:56] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:56] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:56] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:56] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:56] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:57] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:57] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:57] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:57] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:57] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:57] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:57] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:57] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:57] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:57] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:58] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:58] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:58] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:58] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:58] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:58] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:58] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:58] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:58] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:58] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:58] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:58] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:58] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:58] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:58] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:58] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:59] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:59] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:59] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:59] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:59] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:59] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:59] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:59] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:59] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:59] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:59] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:03:59] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:00] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:00] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:00] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:00] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:00] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:00] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:00] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:00] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:00] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:00] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:01] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:01] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:01] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:01] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:01] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:01] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:01] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:01] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:01] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:01] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:01] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:01] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:01] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:01] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:02] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:02] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:02] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:02] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:02] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:02] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:02] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:02] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:02] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:02] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:02] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:02] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:03] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:03] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:03] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:03] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:03] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:03] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:03] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:03] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:03] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:03] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:04] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:05] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:06] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:07] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:08] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:09] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:10] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:11] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:12] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:13] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:14] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:15] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:16] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:17] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:18] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:19] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:20] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
## [15:04:21] WARNING: amalgamation/../src/c_api/c_api.cc:718: `ntree_limit` is deprecated, use `iteration_range` instead.
caret.xgb
## eXtreme Gradient Boosting 
## 
## 1279 samples
##   11 predictor
##    2 classes: 'aprobado', 'suspenso' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1023, 1023, 1024, 1023, 1023 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  subsample  nrounds  Accuracy   Kappa    
##   0.3  1          0.6               0.50        50      0.7584252  0.5157983
##   0.3  1          0.6               0.50       100      0.7615594  0.5214823
##   0.3  1          0.6               0.50       150      0.7592126  0.5168812
##   0.3  1          0.6               0.75        50      0.7685784  0.5357000
##   0.3  1          0.6               0.75       100      0.7631158  0.5251201
##   0.3  1          0.6               0.75       150      0.7662439  0.5317973
##   0.3  1          0.6               1.00        50      0.7709252  0.5405230
##   0.3  1          0.6               1.00       100      0.7631189  0.5252044
##   0.3  1          0.6               1.00       150      0.7623346  0.5234812
##   0.3  1          0.8               0.50        50      0.7529473  0.5043059
##   0.3  1          0.8               0.50       100      0.7599847  0.5184122
##   0.3  1          0.8               0.50       150      0.7584038  0.5151712
##   0.3  1          0.8               0.75        50      0.7670221  0.5322787
##   0.3  1          0.8               0.75       100      0.7709314  0.5404779
##   0.3  1          0.8               0.75       150      0.7685876  0.5355394
##   0.3  1          0.8               1.00        50      0.7670129  0.5328119
##   0.3  1          0.8               1.00       100      0.7623346  0.5236110
##   0.3  1          0.8               1.00       150      0.7623346  0.5238734
##   0.3  2          0.6               0.50        50      0.7693566  0.5379810
##   0.3  2          0.6               0.50       100      0.7810784  0.5609420
##   0.3  2          0.6               0.50       150      0.7756189  0.5492785
##   0.3  2          0.6               0.75        50      0.7771814  0.5531543
##   0.3  2          0.6               0.75       100      0.7787561  0.5555964
##   0.3  2          0.6               0.75       150      0.7779687  0.5546161
##   0.3  2          0.6               1.00        50      0.7709252  0.5415508
##   0.3  2          0.6               1.00       100      0.7834375  0.5660762
##   0.3  2          0.6               1.00       150      0.7818781  0.5625740
##   0.3  2          0.8               0.50        50      0.7662347  0.5309808
##   0.3  2          0.8               0.50       100      0.7795313  0.5579521
##   0.3  2          0.8               0.50       150      0.7740472  0.5468629
##   0.3  2          0.8               0.75        50      0.7615319  0.5223741
##   0.3  2          0.8               0.75       100      0.7717218  0.5423465
##   0.3  2          0.8               0.75       150      0.7693750  0.5372122
##   0.3  2          0.8               1.00        50      0.7740472  0.5483062
##   0.3  2          0.8               1.00       100      0.7842279  0.5672228
##   0.3  2          0.8               1.00       150      0.7709314  0.5403959
##   0.3  3          0.6               0.50        50      0.7779779  0.5544845
##   0.3  3          0.6               0.50       100      0.7662500  0.5300660
##   0.3  3          0.6               0.50       150      0.7725000  0.5422163
##   0.3  3          0.6               0.75        50      0.7779687  0.5537440
##   0.3  3          0.6               0.75       100      0.7857874  0.5685593
##   0.3  3          0.6               0.75       150      0.7826532  0.5626896
##   0.3  3          0.6               1.00        50      0.7701409  0.5392758
##   0.3  3          0.6               1.00       100      0.7818842  0.5616341
##   0.3  3          0.6               1.00       150      0.7873529  0.5723541
##   0.3  3          0.8               0.50        50      0.7623529  0.5230456
##   0.3  3          0.8               0.50       100      0.7631189  0.5236251
##   0.3  3          0.8               0.50       150      0.7779657  0.5535856
##   0.3  3          0.8               0.75        50      0.7716881  0.5413281
##   0.3  3          0.8               0.75       100      0.7818689  0.5614994
##   0.3  3          0.8               0.75       150      0.7818627  0.5613169
##   0.3  3          0.8               1.00        50      0.7810999  0.5609327
##   0.3  3          0.8               1.00       100      0.7912561  0.5808283
##   0.3  3          0.8               1.00       150      0.7912561  0.5802007
##   0.4  1          0.6               0.50        50      0.7545129  0.5074292
##   0.4  1          0.6               0.50       100      0.7607690  0.5204900
##   0.4  1          0.6               0.50       150      0.7560662  0.5113351
##   0.4  1          0.6               0.75        50      0.7623162  0.5232939
##   0.4  1          0.6               0.75       100      0.7607567  0.5197020
##   0.4  1          0.6               0.75       150      0.7678002  0.5336260
##   0.4  1          0.6               1.00        50      0.7599724  0.5186811
##   0.4  1          0.6               1.00       100      0.7576409  0.5144002
##   0.4  1          0.6               1.00       150      0.7584283  0.5156999
##   0.4  1          0.8               0.50        50      0.7545129  0.5080772
##   0.4  1          0.8               0.50       100      0.7568597  0.5132168
##   0.4  1          0.8               0.50       150      0.7678033  0.5339583
##   0.4  1          0.8               0.75        50      0.7646752  0.5283858
##   0.4  1          0.8               0.75       100      0.7631189  0.5249673
##   0.4  1          0.8               0.75       150      0.7678186  0.5340579
##   0.4  1          0.8               1.00        50      0.7568474  0.5124567
##   0.4  1          0.8               1.00       100      0.7631097  0.5251788
##   0.4  1          0.8               1.00       150      0.7709252  0.5411619
##   0.4  2          0.6               0.50        50      0.7591912  0.5170492
##   0.4  2          0.6               0.50       100      0.7592034  0.5159729
##   0.4  2          0.6               0.50       150      0.7607659  0.5191407
##   0.4  2          0.6               0.75        50      0.7662439  0.5317023
##   0.4  2          0.6               0.75       100      0.7685907  0.5357065
##   0.4  2          0.6               0.75       150      0.7670313  0.5316144
##   0.4  2          0.6               1.00        50      0.7717065  0.5422725
##   0.4  2          0.6               1.00       100      0.7818689  0.5627269
##   0.4  2          0.6               1.00       150      0.7803002  0.5587702
##   0.4  2          0.8               0.50        50      0.7670221  0.5322487
##   0.4  2          0.8               0.50       100      0.7756281  0.5490931
##   0.4  2          0.8               0.50       150      0.7693811  0.5366440
##   0.4  2          0.8               0.75        50      0.7615380  0.5209873
##   0.4  2          0.8               0.75       100      0.7725031  0.5429529
##   0.4  2          0.8               0.75       150      0.7834467  0.5652049
##   0.4  2          0.8               1.00        50      0.7771875  0.5531451
##   0.4  2          0.8               1.00       100      0.7724969  0.5436643
##   0.4  2          0.8               1.00       150      0.7771906  0.5527097
##   0.4  3          0.6               0.50        50      0.7599939  0.5173947
##   0.4  3          0.6               0.50       100      0.7639185  0.5247570
##   0.4  3          0.6               0.50       150      0.7654534  0.5279526
##   0.4  3          0.6               0.75        50      0.7600092  0.5180450
##   0.4  3          0.6               0.75       100      0.7795435  0.5569284
##   0.4  3          0.6               0.75       150      0.7779718  0.5536798
##   0.4  3          0.6               1.00        50      0.7701562  0.5381242
##   0.4  3          0.6               1.00       100      0.7810938  0.5604603
##   0.4  3          0.6               1.00       150      0.7795404  0.5570991
##   0.4  3          0.8               0.50        50      0.7670190  0.5321376
##   0.4  3          0.8               0.50       100      0.7725061  0.5421105
##   0.4  3          0.8               0.50       150      0.7756281  0.5483835
##   0.4  3          0.8               0.75        50      0.7701562  0.5376402
##   0.4  3          0.8               0.75       100      0.7803033  0.5580894
##   0.4  3          0.8               0.75       150      0.7865686  0.5709177
##   0.4  3          0.8               1.00        50      0.7810999  0.5600757
##   0.4  3          0.8               1.00       100      0.7850184  0.5680617
##   0.4  3          0.8               1.00       150      0.7818811  0.5616134
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## Tuning
##  parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 100, max_depth = 3, eta
##  = 0.3, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1 and subsample
##  = 1.

Para optimizar los resultados del modelo creado, realizamos un ajuste de hiperparámetros con los valores obtenidos:

caret.xgb$bestTune
##    nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 53     100         3 0.3     0              0.8                1         1

Obtenemos una configuración óptima de los hiperparámetros del modelo en nrounds = 100, max_depth = 3, eta = 0.3, gamma = 0, colsample_bytree = 0.8, min_child_weight= 1 y subsample = 1.

Vemos la relevancia de cada variable en el modelo:

varImp(caret.xgb)
## xgbTree variable importance
## 
##                          Overall
## alcohol                  100.000
## Log_sulphates             47.822
## volatile_acidity          44.375
## Log_total_sulfur_dioxide  31.256
## density                   29.242
## Log_chlorides             23.118
## pH                         9.363
## citric_acid                6.351
## fixed_acidity              3.858
## Log_free_sulfur_dioxide    1.028
## Log_residual_sugar         0.000

Probamos el modelo de base con los datos de test:

test_xgb <- test[, colnames(test) != "quality"]
test_xgb$nota_vino <- factor(test$quality < 6, labels = c("aprobado",
    "suspenso"))  # levels = c('FALSE', 'TRUE')
str(test_xgb)
## tibble [320 x 12] (S3: tbl_df/tbl/data.frame)
##  $ fixed_acidity           : num [1:320] 7.4 7.3 8.9 7.6 7.1 5.7 7.3 8.1 6.8 5.6 ...
##  $ volatile_acidity        : num [1:320] 0.7 0.65 0.22 0.41 0.71 1.13 0.45 0.66 0.67 0.31 ...
##  $ citric_acid             : num [1:320] 0 0 0.48 0.24 0 0.09 0.36 0.22 0.02 0.37 ...
##  $ density                 : num [1:320] 0.998 0.995 0.997 0.996 0.997 ...
##  $ alcohol                 : num [1:320] 9.4 10 9.4 9.5 9.4 9.8 10.5 10.3 9.5 9.2 ...
##  $ pH                      : num [1:320] 3.51 3.39 3.39 3.28 3.47 3.5 3.33 3.3 3.48 3.32 ...
##  $ Log_residual_sugar      : num [1:320] 0.642 0.182 0.588 0.588 0.642 ...
##  $ Log_chlorides           : num [1:320] -2.58 -2.73 -2.56 -2.53 -2.53 ...
##  $ Log_free_sulfur_dioxide : num [1:320] 2.4 2.71 3.37 1.39 2.64 ...
##  $ Log_total_sulfur_dioxide: num [1:320] 3.53 3.04 4.09 2.4 3.56 ...
##  $ Log_sulphates           : num [1:320] -0.58 -0.755 -0.635 -0.528 -0.598 ...
##  $ nota_vino               : Factor w/ 2 levels "aprobado","suspenso": 2 1 1 2 2 2 2 2 2 2 ...
confusionMatrix(predict(caret.xgb, newdata = test_xgb), test_xgb$nota_vino)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction aprobado suspenso
##   aprobado      117       35
##   suspenso       56      112
##                                           
##                Accuracy : 0.7156          
##                  95% CI : (0.6628, 0.7644)
##     No Information Rate : 0.5406          
##     P-Value [Acc > NIR] : 1.058e-10       
##                                           
##                   Kappa : 0.4336          
##                                           
##  Mcnemar's Test P-Value : 0.03603         
##                                           
##             Sensitivity : 0.6763          
##             Specificity : 0.7619          
##          Pos Pred Value : 0.7697          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.5406          
##          Detection Rate : 0.3656          
##    Detection Prevalence : 0.4750          
##       Balanced Accuracy : 0.7191          
##                                           
##        'Positive' Class : aprobado        
## 

Se obtiene un valor del accuracy en test de 71.56%

Tratamos de añadir al modelo base la configuración de hiperparámetros óptima obtenida con anterioridad:

set.seed(1)
caret.xgb1 <- train(nota_vino ~ ., method = "xgbTree", data = train_xgb,
    tuneGrid = data.frame(nrounds = 100, max_depth = 3, eta = 0.3,
        gamma = 0, colsample_bytree = 0.8, min_child_weight = 1,
        subsample = 1), trControl = trainControl(method = "cv",
        number = 5))
caret.xgb1
## eXtreme Gradient Boosting 
## 
## 1279 samples
##   11 predictor
##    2 classes: 'aprobado', 'suspenso' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 1024, 1024, 1023, 1023, 1022 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7763987  0.5515202
## 
## Tuning parameter 'nrounds' was held constant at a value of 100
## Tuning
##  held constant at a value of 1
## Tuning parameter 'subsample' was held
##  constant at a value of 1

Obtenemos un valor del accuracy con la configuración de parámetros óptima en train de 77.63%

confusionMatrix(predict(caret.xgb1, newdata = test_xgb), test_xgb$nota_vino)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction aprobado suspenso
##   aprobado      124       36
##   suspenso       49      111
##                                          
##                Accuracy : 0.7344         
##                  95% CI : (0.6824, 0.782)
##     No Information Rate : 0.5406         
##     P-Value [Acc > NIR] : 7.942e-13      
##                                          
##                   Kappa : 0.4688         
##                                          
##  Mcnemar's Test P-Value : 0.1931         
##                                          
##             Sensitivity : 0.7168         
##             Specificity : 0.7551         
##          Pos Pred Value : 0.7750         
##          Neg Pred Value : 0.6937         
##              Prevalence : 0.5406         
##          Detection Rate : 0.3875         
##    Detection Prevalence : 0.5000         
##       Balanced Accuracy : 0.7359         
##                                          
##        'Positive' Class : aprobado       
## 

Obtenemos un valor del accuracy con la configuración de parámetros óptima en test de 73.44%.

10.6. SVM Lineal - Support Vector Machines Lineal

Utilizamos el paquete kernlab para crear nuestro algoritmo SVM y entrenamos nuestro modelo con la función train() del paquete carret.

10.6.1. Creación de variable binaria

En nuestro dataset de train y test, hemos creado la variable binaria “nota_vino”para que, en función de “quality,nos diga los vinos con calificaciones aprobadas (quality >= 6) o suspensas (quality < 6).

train_svm <- train[, colnames(train)!="quality"]
train_svm$nota_vino <- factor(train$quality < 6, labels = c('aprobado', 'suspenso')) # levels = c('FALSE', 'TRUE')
train_svm
## # A tibble: 1,279 x 12
##    fixed_acidity volatile_acidity citric_acid density alcohol    pH
##            <dbl>            <dbl>       <dbl>   <dbl>   <dbl> <dbl>
##  1           7.1            0.48         0.28   0.997    10.3  3.24
##  2           7.6            0.49         0.33   0.997     9    3.3 
##  3           5              1.02         0.04   0.994    10.5  3.75
##  4           7.6            0.43         0.29   0.997     9.5  3.4 
##  5           6.8            0.59         0.1    0.996     9.7  3.3 
##  6           6.8            0.815        0      0.995     9.8  3.3 
##  7           8.5            0.21         0.52   0.996    10.4  3.36
##  8           7.4            0.36         0.29   0.996    11    3.3 
##  9           5.5            0.49         0.03   0.991    14    3.3 
## 10           6.8            0.49         0.22   0.994    11.3  3.41
## # ... with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## #   Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## #   Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <fct>

10.6.2. Creación del modelo SVM Lineal

Creamos un modelo SVM Lineal con todos los predictores de nuestro data set.

set.seed(13)

modelo_svmlineal <- train(nota_vino ~ ., method = "svmLinear", data = train_svm)
modelo_svmlineal$finalModel
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Linear (vanilla) kernel function. 
## 
## Number of Support Vectors : 731 
## 
## Objective Function Value : -722.4628 
## Training error : 0.244722

Con el objeto finalModel podemos observar cuales son los hiperparámetros utilizados (cost C) y el error de entrenamiento. Este error se corresponde con el error que comete mi modelo al intentar predicir las mismas observaciones con las que se ha entrenado. En nuestro caso tenemos un 24,44% de error.

10.6.2. Evaluación del modelo SVM Lineal mediante Cross Validation

Intentamos ajustar y evaluar nuestro modelo múltiples veces con distintos subconjuntos creados a partir de los datos de entrenamiento mediante Cross Validation, obteniendo para cada repetición una estimación del error. Cuando se aplican estos métodos, el coste computacional de ajustar múltiplas veces un modelo es alto y por eso con caret, podemos paralelizar el proceso para que sea más rápido.

#paralelización

registerDoMC(cores = 4)


#número de repeticiones para realizar la validación cruzada

particiones  <- 10
repeticiones <- 5



#modelo

control_modelosvm_lineal <- trainControl(method = "repeatedcv", number = particiones,repeats = repeticiones,returnResamp = "all", verboseIter = FALSE,allowParallel = TRUE)


set.seed(342)
modelo_svmlineal <- train(nota_vino ~ ., data = train_svm,
                          method = "svmLinear",
                          metric = "Accuracy",
                          trControl = control_modelosvm_lineal)
modelo_svmlineal
## Support Vector Machines with Linear Kernel 
## 
## 1279 samples
##   11 predictor
##    2 classes: 'aprobado', 'suspenso' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1151, 1151, 1150, 1151, 1152, 1151, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7513235  0.5032292
## 
## Tuning parameter 'C' was held constant at a value of 1

Con esta validación cruzada con 10 particiones y 5 repeticiones hemos ajustado nuestro modelo 50 veces. Podemos pintar una gráfica con el accuracy obtenido en cada uno de estos modelos.

grafo1 <- ggplot(data = modelo_svmlineal$resample, aes(x = Accuracy)) + geom_density(alpha = 0.5, fill = "blue") +geom_vline(xintercept = mean(modelo_svmlineal$resample$Accuracy),linetype = "dashed") + theme_bw() 
grafo2 <- ggplot(data = modelo_svmlineal$resample, aes(x = 1, y = Accuracy)) +geom_boxplot(outlier.shape = NA, alpha = 0.5, fill = "blue") +
      geom_jitter(width = 0.05) +labs(x = "") +theme_bw() + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
final_plot_svm <- ggarrange(grafo1, grafo2)
final_plot_svm <- annotate_figure(final_plot_svm,top = text_grob("Accuracy obtenido en la validación", size = 15))
final_plot_svm

Con validación cruzada llegamos a un accuracy promedio de 0.7513. Eso es lo mismo que decir que, mi modelo SVM Lineal predice si un vino es bueno o malo 75% de las veces.

10.6.3 Optmización de hiperparámetros modelo SVM Lineal

Como hemos visto, nuestro modelo svmLinear tiene un hiperparámetro llamado coste (C). volveremos a ajustar nuestro modelo con diferentes modelos de C y aplicamos validación cruzada otra vez, para volver a identificar en cuál de los submodelos se obtiene el mejor resultado.

Para los diferentes valores de C, hemos elegido trabajar con grid search donde se especifican los valores exactos de los hiperparámetros.

#paralelización

registerDoMC(cores = 4)

#hiperparámetros y número de repeticiones
particiones  <- 10
repeticiones <- 5

hiperparametros <- data.frame(C = c(0.001, 0.01, 0.1, 0.5, 1, 10))


#modelo
control_modelosvm_lineal <- trainControl(method = "repeatedcv", number = particiones,repeats = repeticiones,returnResamp = "all", verboseIter =FALSE,allowParallel = TRUE)


set.seed(342)
modelo_svmlineal <- train(nota_vino ~ ., data = train_svm,method = "svmLinear",tuneGrid = hiperparametros,metric = "Accuracy",trControl =control_modelosvm_lineal)
modelo_svmlineal
## Support Vector Machines with Linear Kernel 
## 
## 1279 samples
##   11 predictor
##    2 classes: 'aprobado', 'suspenso' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times) 
## Summary of sample sizes: 1151, 1151, 1150, 1151, 1152, 1151, ... 
## Resampling results across tuning parameters:
## 
##   C      Accuracy   Kappa    
##   1e-03  0.7305088  0.4638641
##   1e-02  0.7525735  0.5062830
##   1e-01  0.7499246  0.5006288
##   5e-01  0.7506985  0.5019843
##   1e+00  0.7513235  0.5032292
##   1e+01  0.7516373  0.5038068
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.01.

Podemos observar que en nuestro caso de todos los valores de C utilizados, C=0.01 es lo que nos devuelve los mejores resultados con un accuracy de 0.7526.

Observamos la variación la variación de nuestros accuracy para cada uno de los valores de C en el grafo abajo.

ggplot(data = modelo_svmlineal$resample,aes(x = as.factor(C), y = Accuracy, color = as.factor(C))) +geom_boxplot(outlier.shape = NA, alpha = 0.6) +
  geom_jitter(width = 0.2, alpha = 0.6) + geom_hline(yintercept = 0.62, linetype = "dashed") +labs(x = "C") + theme_bw() + theme(legend.position = "none")

También podemos observar la evolución de los modelos según los valores de hiperparámetros que hemos elegido.

ggplot(modelo_svmlineal, highlight = TRUE) +labs(title = "Evolución accuracy en función de C") +  theme_bw()

11. Aprendizaje no supervisado

11.1. K-MEANS

Quitamos la variable respuesta “quality”:

train_kmeans <- train[, -6]
train_kmeans
## # A tibble: 1,279 x 11
##    fixed_acidity volatile_acidity citric_acid density alcohol    pH
##            <dbl>            <dbl>       <dbl>   <dbl>   <dbl> <dbl>
##  1           7.1            0.48         0.28   0.997    10.3  3.24
##  2           7.6            0.49         0.33   0.997     9    3.3 
##  3           5              1.02         0.04   0.994    10.5  3.75
##  4           7.6            0.43         0.29   0.997     9.5  3.4 
##  5           6.8            0.59         0.1    0.996     9.7  3.3 
##  6           6.8            0.815        0      0.995     9.8  3.3 
##  7           8.5            0.21         0.52   0.996    10.4  3.36
##  8           7.4            0.36         0.29   0.996    11    3.3 
##  9           5.5            0.49         0.03   0.991    14    3.3 
## 10           6.8            0.49         0.22   0.994    11.3  3.41
## # ... with 1,269 more rows, and 5 more variables: Log_residual_sugar <dbl>,
## #   Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## #   Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>

Buscamos el valor óptimo de cluster a tener en nuestro modelo:

fviz_nbclust(train_kmeans, kmeans, method = "wss")

Otra forma de buscar el óptimo:

#calculate gap statistic based on number of clusters
gap_stat <- clusGap(train_kmeans,
                    FUN = kmeans,
                    nstart = 25,
                    K.max = 10,
                    B = 50)

#plot number of clusters vs. gap statistic
fviz_gap_stat(gap_stat)

Desarrollamos el Clustering con K-means con el número óptimo de K:

#make this example reproducible
set.seed(666)

km <- kmeans(train_kmeans, centers = 3, nstart = 25)

#view results
km
## K-means clustering with 3 clusters of sizes 615, 352, 312
## 
## Cluster means:
##   fixed_acidity volatile_acidity citric_acid   density   alcohol       pH
## 1      7.798862        0.5699187   0.2045041 0.9969913  9.696016 3.316846
## 2      7.110227        0.5246023   0.2134659 0.9948793 11.558286 3.398239
## 3     10.864103        0.4420192   0.4761218 0.9984547 10.610791 3.189583
##   Log_residual_sugar Log_chlorides Log_free_sulfur_dioxide
## 1          0.8186424     -2.434321                2.701835
## 2          0.8225008     -2.642422                2.513108
## 3          0.9650535     -2.460537                2.270269
##   Log_total_sulfur_dioxide Log_sulphates
## 1                 3.861476    -0.4820557
## 2                 3.374945    -0.4497846
## 3                 3.353542    -0.3855143
## 
## Clustering vector:
##    [1] 2 1 2 1 1 1 1 2 2 2 2 3 1 2 3 1 3 2 2 2 3 2 1 1 2 3 1 3 2 3 1 1 2 2 1 3 1
##   [38] 1 3 3 2 1 2 3 2 2 3 1 1 1 1 1 3 2 2 3 2 1 3 1 1 2 1 3 3 1 2 1 3 2 1 3 1 1
##   [75] 1 3 1 2 2 1 3 2 3 1 1 1 1 1 1 2 1 2 2 2 1 1 1 3 3 3 1 3 1 2 2 1 1 1 1 3 2
##  [112] 1 1 1 1 1 3 1 3 1 1 2 1 1 2 3 2 1 2 1 2 1 2 2 1 2 1 2 2 3 1 2 2 1 3 2 1 1
##  [149] 3 3 1 1 2 2 1 1 3 2 2 3 3 1 3 1 2 1 1 2 1 1 1 2 1 1 1 3 1 2 3 2 2 1 1 2 3
##  [186] 3 3 1 3 1 1 3 1 2 1 2 2 3 2 2 1 2 1 2 3 2 2 2 1 1 1 2 3 3 2 3 2 3 1 3 2 1
##  [223] 1 1 3 2 1 1 2 1 3 2 1 1 2 3 3 3 2 1 1 2 1 2 2 1 1 3 3 1 2 2 3 3 1 2 1 2 3
##  [260] 1 1 2 2 1 1 1 1 1 2 2 1 1 1 2 3 3 1 1 3 3 2 3 2 2 3 2 1 1 1 1 2 1 1 3 2 2
##  [297] 2 2 3 1 3 1 1 3 3 3 2 1 1 2 1 2 1 1 3 3 1 1 2 1 2 1 3 2 2 3 3 2 1 3 3 1 1
##  [334] 1 1 3 3 3 3 1 1 1 3 1 3 2 1 2 1 1 1 1 3 1 2 1 2 1 1 1 2 2 3 1 2 1 1 1 3 1
##  [371] 1 1 3 1 2 1 1 1 2 1 1 3 3 1 2 3 1 1 1 3 3 1 1 2 1 2 1 2 1 1 3 2 3 2 3 1 2
##  [408] 2 2 1 1 1 1 1 1 3 1 1 2 1 1 3 3 1 2 1 3 3 1 1 3 1 3 3 2 3 1 2 3 1 2 1 3 2
##  [445] 2 3 3 1 2 2 2 3 1 2 2 3 1 1 2 1 3 1 1 2 3 1 3 2 1 1 2 1 3 2 2 3 1 2 1 2 3
##  [482] 1 1 3 3 2 3 2 3 1 1 3 1 1 3 1 1 3 1 3 2 2 1 1 1 3 1 1 1 1 3 2 3 3 3 3 1 3
##  [519] 1 2 1 3 1 1 1 2 2 1 3 1 2 3 1 2 3 2 2 1 1 2 1 2 2 2 2 2 2 2 2 1 3 3 2 3 2
##  [556] 1 1 3 2 1 1 1 1 2 2 3 1 2 1 3 3 3 1 1 1 1 2 1 2 3 1 1 2 1 1 1 2 1 1 1 1 3
##  [593] 2 3 2 3 3 1 1 1 3 1 3 2 2 2 2 1 3 1 1 1 2 3 1 1 3 2 2 1 3 2 1 1 3 1 1 1 3
##  [630] 2 3 1 3 3 3 2 1 3 1 3 1 1 1 3 1 2 1 1 2 1 1 2 1 1 3 3 1 3 1 1 1 2 3 3 1 3
##  [667] 1 1 1 3 2 1 1 1 2 1 2 1 3 3 1 1 1 1 1 2 1 2 1 1 1 2 2 3 1 2 3 2 2 1 1 1 1
##  [704] 3 3 1 2 1 1 3 1 1 1 2 1 2 3 2 3 1 2 3 1 1 2 1 1 1 1 1 1 1 2 1 1 1 3 1 3 1
##  [741] 3 1 2 1 3 2 1 2 1 3 2 2 1 1 3 1 2 2 3 2 1 1 2 1 2 3 2 1 1 3 1 3 3 1 3 3 2
##  [778] 3 1 1 2 3 1 1 2 1 1 1 1 3 3 1 1 3 2 1 1 3 3 1 3 2 2 2 1 2 1 1 2 3 1 3 1 1
##  [815] 2 1 1 2 1 1 1 3 2 1 2 1 1 3 2 1 2 3 1 1 1 3 3 1 2 1 2 1 3 3 1 1 3 2 1 2 2
##  [852] 1 1 3 3 3 2 1 1 1 1 3 1 1 2 1 1 2 1 2 2 3 2 1 3 2 1 2 2 2 1 2 1 1 2 2 1 1
##  [889] 1 1 2 3 2 3 3 1 1 3 1 2 3 1 1 1 2 1 2 1 3 1 2 1 1 2 1 1 3 2 2 2 1 2 1 1 2
##  [926] 3 3 3 1 3 1 1 1 2 1 1 1 3 3 1 1 1 1 2 1 1 1 3 3 3 2 1 1 1 3 1 1 3 2 2 1 3
##  [963] 1 1 3 2 2 1 3 1 3 1 1 1 1 3 3 1 1 1 1 1 3 1 2 1 1 2 2 3 1 2 1 2 1 1 3 2 1
## [1000] 3 1 2 1 1 1 2 3 1 2 3 1 1 1 3 3 3 1 1 1 1 2 1 2 3 1 3 1 1 3 1 1 1 1 1 1 3
## [1037] 3 2 3 1 2 1 3 1 1 3 1 1 1 3 2 1 1 3 2 1 1 2 3 1 2 1 1 1 3 1 3 1 3 2 2 1 2
## [1074] 1 1 3 2 2 1 2 3 1 1 3 1 1 1 1 3 2 1 2 1 2 1 2 2 1 1 1 1 1 2 2 1 2 3 3 2 3
## [1111] 2 1 1 2 1 1 3 2 2 3 3 3 1 2 1 1 3 1 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## [1148] 1 1 3 1 3 1 2 1 3 1 1 2 1 2 3 1 2 1 2 3 3 1 3 1 1 2 1 2 2 1 3 3 3 1 2 1 1
## [1185] 1 1 1 3 2 1 2 3 2 1 1 1 1 1 1 3 2 2 2 2 1 1 1 3 1 2 1 2 1 1 2 1 1 1 3 3 2
## [1222] 2 1 2 2 1 2 2 3 1 3 1 1 1 2 2 1 1 1 2 2 1 2 2 1 1 1 2 2 2 1 2 1 1 3 3 2 2
## [1259] 3 3 1 3 3 1 3 1 2 3 3 2 2 3 3 1 3 2 3 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 1246.609 1065.266 1221.854
##  (between_SS / total_SS =  50.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Graficamos los resultados obtenidos:

fviz_cluster(km, data = train_kmeans, geom = "point")

Vemos la media de los valores para cada uno de los diferentes clusters y su tamaño:

aggregate(train_kmeans, by=list(cluster=km$cluster), mean)
##   cluster fixed_acidity volatile_acidity citric_acid   density   alcohol
## 1       1      7.798862        0.5699187   0.2045041 0.9969913  9.696016
## 2       2      7.110227        0.5246023   0.2134659 0.9948793 11.558286
## 3       3     10.864103        0.4420192   0.4761218 0.9984547 10.610791
##         pH Log_residual_sugar Log_chlorides Log_free_sulfur_dioxide
## 1 3.316846          0.8186424     -2.434321                2.701835
## 2 3.398239          0.8225008     -2.642422                2.513108
## 3 3.189583          0.9650535     -2.460537                2.270269
##   Log_total_sulfur_dioxide Log_sulphates
## 1                 3.861476    -0.4820557
## 2                 3.374945    -0.4497846
## 3                 3.353542    -0.3855143
km$size
## [1] 615 352 312
table(km$cluster, train$quality)
##    
##       3   4   5   6   7   8
##   1   2  16 393 188  15   1
##   2   3  18  71 178  74   8
##   3   2   4  88 147  67   4
wholesaleBest = FitKMeans(train_kmeans, max.clusters = 10, nstart = 25, seed = 666)
wholesaleBest
##   Clusters  Hartigan AddCluster
## 1        2 783.50650       TRUE
## 2        3 332.10622       TRUE
## 3        4 198.08116       TRUE
## 4        5 178.29780       TRUE
## 5        6 153.88342       TRUE
## 6        7  88.19689       TRUE
## 7        8  85.66108       TRUE
## 8        9  87.23972       TRUE
## 9       10  74.03035       TRUE
PlotHartigan(wholesaleBest)

11.2. Clustering Jerárquico

11.3. DBSCAN/HDBSCAN

11.4. Mixture of gaussians

11.5. Spectral clustering

12. GAM - Generalised Additive Model

12.1. Con Linear GAM

¿Será con variable respuesta continua?

gam_mod <- gam(quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) + 
    s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) + s(citric_acid) + 
    s(fixed_acidity), data = train, method = "REML")
gam_mod
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) + 
##     s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) + 
##     s(citric_acid) + s(fixed_acidity)
## 
## Estimated degrees of freedom:
## 3.80 1.00 3.69 1.00 2.25 2.94 1.99 
## 2.55  total = 20.23 
## 
## REML score: 1221.011
coef(gam_mod)
##                   (Intercept)                  s(alcohol).1 
##                  5.634871e+00                  1.648020e-01 
##                  s(alcohol).2                  s(alcohol).3 
##                 -1.165238e-01                  7.041699e-02 
##                  s(alcohol).4                  s(alcohol).5 
##                 -9.552495e-02                 -4.469214e-02 
##                  s(alcohol).6                  s(alcohol).7 
##                  1.131493e-01                 -4.810090e-02 
##                  s(alcohol).8                  s(alcohol).9 
##                  6.033374e-01                  8.274343e-02 
##         s(volatile_acidity).1         s(volatile_acidity).2 
##                 -4.570556e-05                  1.038005e-06 
##         s(volatile_acidity).3         s(volatile_acidity).4 
##                 -1.351067e-05                 -1.793403e-05 
##         s(volatile_acidity).5         s(volatile_acidity).6 
##                  1.458357e-05                  1.751913e-05 
##         s(volatile_acidity).7         s(volatile_acidity).8 
##                 -1.271810e-05                  1.597256e-04 
##         s(volatile_acidity).9            s(Log_sulphates).1 
##                 -1.701127e-01                 -7.829302e-03 
##            s(Log_sulphates).2            s(Log_sulphates).3 
##                 -7.099365e-02                  4.537419e-02 
##            s(Log_sulphates).4            s(Log_sulphates).5 
##                 -3.268186e-02                 -4.713404e-02 
##            s(Log_sulphates).6            s(Log_sulphates).7 
##                  1.907383e-02                  1.687056e-02 
##            s(Log_sulphates).8            s(Log_sulphates).9 
##                  1.915352e-01                  1.288852e-01 
##            s(Log_chlorides).1            s(Log_chlorides).2 
##                  8.907392e-06                 -3.929104e-07 
##            s(Log_chlorides).3            s(Log_chlorides).4 
##                 -4.775713e-06                 -2.441926e-06 
##            s(Log_chlorides).5            s(Log_chlorides).6 
##                  2.867426e-06                  5.221761e-07 
##            s(Log_chlorides).7            s(Log_chlorides).8 
##                 -2.957613e-06                 -1.322006e-05 
##            s(Log_chlorides).9                       s(pH).1 
##                 -6.951358e-02                 -9.276906e-03 
##                       s(pH).2                       s(pH).3 
##                  1.043393e-02                 -9.687645e-03 
##                       s(pH).4                       s(pH).5 
##                 -1.755619e-02                  2.316811e-03 
##                       s(pH).6                       s(pH).7 
##                  1.786883e-02                  9.528146e-03 
##                       s(pH).8                       s(pH).9 
##                  1.380056e-01                 -9.893089e-02 
## s(Log_total_sulfur_dioxide).1 s(Log_total_sulfur_dioxide).2 
##                  1.264371e-01                 -7.861424e-03 
## s(Log_total_sulfur_dioxide).3 s(Log_total_sulfur_dioxide).4 
##                  3.959153e-02                 -2.883118e-02 
## s(Log_total_sulfur_dioxide).5 s(Log_total_sulfur_dioxide).6 
##                 -3.644112e-02                  3.414453e-02 
## s(Log_total_sulfur_dioxide).7 s(Log_total_sulfur_dioxide).8 
##                  2.681938e-02                  1.658594e-01 
## s(Log_total_sulfur_dioxide).9              s(citric_acid).1 
##                  4.615124e-02                 -8.294107e-03 
##              s(citric_acid).2              s(citric_acid).3 
##                  2.156461e-02                 -8.422364e-03 
##              s(citric_acid).4              s(citric_acid).5 
##                  1.810441e-02                  7.342591e-03 
##              s(citric_acid).6              s(citric_acid).7 
##                 -2.160911e-02                  7.256404e-03 
##              s(citric_acid).8              s(citric_acid).9 
##                 -1.070232e-01                 -3.440155e-02 
##            s(fixed_acidity).1            s(fixed_acidity).2 
##                 -3.380473e-03                 -4.880922e-02 
##            s(fixed_acidity).3            s(fixed_acidity).4 
##                 -2.163905e-02                 -4.395839e-02 
##            s(fixed_acidity).5            s(fixed_acidity).6 
##                  1.957378e-02                 -4.981481e-02 
##            s(fixed_acidity).7            s(fixed_acidity).8 
##                 -2.147517e-02                  2.926902e-01 
##            s(fixed_acidity).9 
##                 -3.132917e-02
summary(gam_mod)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) + 
##     s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) + 
##     s(citric_acid) + s(fixed_acidity)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.63487    0.01713   328.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                               edf Ref.df      F  p-value    
## s(alcohol)                  3.799  4.762 54.064  < 2e-16 ***
## s(volatile_acidity)         1.003  1.006 54.789  < 2e-16 ***
## s(Log_sulphates)            3.691  4.634 23.398  < 2e-16 ***
## s(Log_chlorides)            1.000  1.001 11.922 0.000572 ***
## s(pH)                       2.252  2.906  5.016 0.001882 ** 
## s(Log_total_sulfur_dioxide) 2.943  3.724  3.550 0.008809 ** 
## s(citric_acid)              1.991  2.508  2.440 0.062463 .  
## s(fixed_acidity)            2.553  3.260  1.909 0.134935    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.404   Deviance explained = 41.3%
## -REML =   1221  Scale est. = 0.37532   n = 1279
gam_mod2 = gam(quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates),
    data = train, method = "REML")

summary(gam_mod2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## quality ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.63487    0.01749   322.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                       edf Ref.df     F p-value    
## s(alcohol)          3.958  4.951 60.74  <2e-16 ***
## s(volatile_acidity) 1.001  1.002 88.52  <2e-16 ***
## s(Log_sulphates)    4.191  5.202 19.72  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.379   Deviance explained = 38.3%
## -REML = 1231.4  Scale est. = 0.39111   n = 1279
plot(gam_mod2, residuals = TRUE, pch = 1, shade = TRUE, shade.col = "lightblue")

gam.check(gam_mod2)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [-2.529408e-05,1.537895e-06]
## (score 1231.431 & scale 0.3911066).
## Hessian positive definite, eigenvalue range [2.519977e-05,637.5075].
## Model rank =  28 / 28 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                       k'  edf k-index p-value  
## s(alcohol)          9.00 3.96    1.03   0.845  
## s(volatile_acidity) 9.00 1.00    0.96   0.065 .
## s(Log_sulphates)    9.00 4.19    0.97   0.140  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

12.2. Con Logistic GAM

https://noamross.github.io/gams-in-r-course/chapter4

Lo primero de todo, creamos la variable binaria “nota_vino”, para que en función de “quality” nos diga los vinos con calificaciones aprobadas (quality >= 6, anotados con un “1”) o suspensas (quality < 6, anotados con un “0”)

train_gam <- train %>%
    mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
    mutate(quality = NULL)
train_gam
## # A tibble: 1,279 x 12
##    fixed_acidity volatile_acidity citric_acid density alcohol    pH
##            <dbl>            <dbl>       <dbl>   <dbl>   <dbl> <dbl>
##  1           7.1            0.48         0.28   0.997    10.3  3.24
##  2           7.6            0.49         0.33   0.997     9    3.3 
##  3           5              1.02         0.04   0.994    10.5  3.75
##  4           7.6            0.43         0.29   0.997     9.5  3.4 
##  5           6.8            0.59         0.1    0.996     9.7  3.3 
##  6           6.8            0.815        0      0.995     9.8  3.3 
##  7           8.5            0.21         0.52   0.996    10.4  3.36
##  8           7.4            0.36         0.29   0.996    11    3.3 
##  9           5.5            0.49         0.03   0.991    14    3.3 
## 10           6.8            0.49         0.22   0.994    11.3  3.41
## # ... with 1,269 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## #   Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## #   Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(train_gam$nota_vino)
## 
##   0   1 
## 597 682

Para aplicar GAM logstico a nuestro problema, utilizamos el paquete mgcv y la familia=binomial que indica a la función GAM que nuestra variable respuesta será 0 o 1, es decir, vino bueno o vino malo. Las variables están envueltas por la función s, que es una función que espeficia que queremos que la relación sea flexible.

gam_mod_log = gam(nota_vino ~ s(alcohol) + s(volatile_acidity) +
    s(Log_sulphates) + s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) +
    s(citric_acid) + s(fixed_acidity), data = train_gam, method = "REML",
    family = binomial)

summary(gam_mod_log)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## nota_vino ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) + 
##     s(Log_chlorides) + s(pH) + s(Log_total_sulfur_dioxide) + 
##     s(citric_acid) + s(fixed_acidity)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.30934    0.07799   3.966  7.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                               edf Ref.df  Chi.sq  p-value    
## s(alcohol)                  5.846  6.974 143.805  < 2e-16 ***
## s(volatile_acidity)         1.855  2.377  27.641 3.25e-06 ***
## s(Log_sulphates)            2.484  3.169  61.194  < 2e-16 ***
## s(Log_chlorides)            6.015  7.183  18.378   0.0121 *  
## s(pH)                       2.416  3.119   6.121   0.1115    
## s(Log_total_sulfur_dioxide) 3.741  4.706  28.623 2.87e-05 ***
## s(citric_acid)              1.490  1.828   6.633   0.0191 *  
## s(fixed_acidity)            2.535  3.226   6.556   0.1200    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.371   Deviance explained = 32.2%
## -REML = 642.13  Scale est. = 1         n = 1279

Nos quedamos solo con las variables más significativas (tres ***).

gam_mod_log2 = gam(nota_vino ~ s(alcohol) + s(volatile_acidity) +
    s(Log_sulphates) + s(Log_total_sulfur_dioxide), data = train_gam,
    method = "REML", family = binomial)

summary(gam_mod_log2)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## nota_vino ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) + 
##     s(Log_total_sulfur_dioxide)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.26735    0.07471   3.579 0.000345 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                               edf Ref.df Chi.sq  p-value    
## s(alcohol)                  5.927  7.058 150.80  < 2e-16 ***
## s(volatile_acidity)         1.049  1.095  34.16  < 2e-16 ***
## s(Log_sulphates)            4.078  5.044  54.78  < 2e-16 ***
## s(Log_total_sulfur_dioxide) 3.557  4.487  35.85 5.63e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.346   Deviance explained = 29.4%
## -REML = 649.89  Scale est. = 1         n = 1279

Nuestro intercept es de 0.26735 y utilizamos la función plogis() para transformar nuestro intercept en una probabilidad.

plogis(0.26773)
## [1] 0.5665355
plogis(coef(gam_mod_log2)[1])
## (Intercept) 
##    0.566441

Este valor significa que nuestro modelo predice una probabilidad inicial de aproximadamente 57% de un vino ser bueno.

plot(gam_mod_log2, residuals = TRUE, pch = 1, shade = TRUE, shade.col = "lightblue", trans = plogis, shift = coef(gam_mod_log2)[1], 
     seWithMean = TRUE, col = "purple")

#predict(gam_mod_log2, type="response", se.fit = TRUE)
#plogis(predict(gam_mod_log2, type="link"))

Probar en test?¿

test_gam <- test %>%
    mutate(nota_vino = case_when(quality >= 6 ~ 1, TRUE ~ 0)) %>%
    mutate(quality = NULL)

test_gam
## # A tibble: 320 x 12
##    fixed_acidity volatile_acidity citric_acid density alcohol    pH
##            <dbl>            <dbl>       <dbl>   <dbl>   <dbl> <dbl>
##  1           7.4             0.7         0      0.998     9.4  3.51
##  2           7.3             0.65        0      0.995    10    3.39
##  3           8.9             0.22        0.48   0.997     9.4  3.39
##  4           7.6             0.41        0.24   0.996     9.5  3.28
##  5           7.1             0.71        0      0.997     9.4  3.47
##  6           5.7             1.13        0.09   0.994     9.8  3.5 
##  7           7.3             0.45        0.36   0.998    10.5  3.33
##  8           8.1             0.66        0.22   0.997    10.3  3.3 
##  9           6.8             0.67        0.02   0.996     9.5  3.48
## 10           5.6             0.31        0.37   0.995     9.2  3.32
## # ... with 310 more rows, and 6 more variables: Log_residual_sugar <dbl>,
## #   Log_chlorides <dbl>, Log_free_sulfur_dioxide <dbl>,
## #   Log_total_sulfur_dioxide <dbl>, Log_sulphates <dbl>, nota_vino <dbl>
table(test_gam$nota_vino)
## 
##   0   1 
## 147 173

Explicando los predictores

head(predict(gam_mod_log2, type = "terms"))
##     s(alcohol) s(volatile_acidity) s(Log_sulphates) s(Log_total_sulfur_dioxide)
## 1 -0.153936041          0.11651423      -0.54774301                  0.15593321
## 2 -0.891143170          0.09074145      -0.24318835                 -0.49397254
## 3  0.007793508         -1.25421557      -0.01908907                 -0.49397254
## 4 -1.063371315          0.24564085       0.09246652                 -0.13861017
## 5 -0.910318840         -0.16603197       0.25861731                  0.09886004
## 6 -0.792699732         -0.73698784      -0.68971449                  0.37393429
predict(gam_mod_log2, type = "terms")[1, ]
##                  s(alcohol)         s(volatile_acidity) 
##                  -0.1539360                   0.1165142 
##            s(Log_sulphates) s(Log_total_sulfur_dioxide) 
##                  -0.5477430                   0.1559332
plogis(sum(predict(gam_mod_log2, type = "terms")) + coef(gam_mod_log2)[1])
## (Intercept) 
##    0.566441

12.2.1. Logistic GAM con datos de test

gam_mod_log2_test = gam(nota_vino ~ s(alcohol) + s(volatile_acidity) +
    s(Log_sulphates) + s(Log_total_sulfur_dioxide), data = test_gam,
    method = "REML", family = binomial)

summary(gam_mod_log2_test)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## nota_vino ~ s(alcohol) + s(volatile_acidity) + s(Log_sulphates) + 
##     s(Log_total_sulfur_dioxide)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   0.2339     0.1332   1.757   0.0789 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                               edf Ref.df Chi.sq  p-value    
## s(alcohol)                  1.000  1.000 25.605 7.28e-07 ***
## s(volatile_acidity)         1.000  1.000 18.279 1.93e-05 ***
## s(Log_sulphates)            2.309  2.948  7.506   0.0627 .  
## s(Log_total_sulfur_dioxide) 2.595  3.265 10.146   0.0231 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.257   Deviance explained = 22.3%
## -REML = 180.54  Scale est. = 1         n = 320
plogis(coef(gam_mod_log2_test)[1])
## (Intercept) 
##   0.5582201
plot(gam_mod_log2_test, residuals = TRUE, pch = 1, shade = TRUE, shade.col = "lightblue", trans = plogis, shift = coef(gam_mod_log2_test)[1], 
     seWithMean = TRUE, col = "purple")

13. Conclusiones generales del estudio

  • Regresión Lineal Mútiple: Vemos una falta de adecuación y ajuste del modelo de regresión lineal múltiple obtenido. Se observa un modelo con unos residuos que presentan heterocedasticidad (varianza no constante en el modelo - se viola la homocedasticidad) y que además no predice de forma adecuada la variable respuesta o dependiente, en base a las variables explicativas o independientes. Al tener una variable dependiente como “quality” que es discreta, un modelo de regresión linela normal no se ajusta a nuestros datos.

  • Reducción de la Dimensionalidad (PCA y t-SNE):

  • Regresión Logística:

  • KNN:

  • Decision Tree:

  • Métodos de Ensamble:

  • Random Forest:

  • Ajuste de Hiperparámetros:

  • Clustering (K-means. Jerárquico y HDBSCAN):

  • GAM: